R Programming Tips

1 Data Types and Basic Operations

R has five basic or “atomic” classes of objects:

  • character
  • numeric (real numbers)
  • integer
  • complex
  • logical (True/False)

Vector and Matrix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
vector()
x <- vector("numeric", length = 10)

c()
x <- c("a", "b", "c")
x <- c(TRUE, FALSE)
as.numeric(x)
as.logical(x)
as.character(x)

attributes() 
print()

m <- matrix(nrow = 2, ncol = 3)
m <- matrix(1:6, nrow = 2, ncol = 3)

m<-1:10
dim(m) <- c(2, 5)

cbind(x, y)
rbind(x, y)

list

Lists are a special type of vector that can contain elements of different classes.

1
x <- list(1, "a", TRUE, 1 + 4i)

Factor

Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.

1
2
3
4
5
6
7
lm()
glm()

x <- factor(c("yes", "yes", "no", "yes", "no"))
table(x)

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))

Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x <- c(1,2,NaN,NA,4)
is.na(x)
is.nan(x)

x<-c(1,2,NA,4,NA,5)
bad <- is.na(x)
x[!bad]

#### What if there are multiple things and you want to take the subset with no missing values?
x<-c(1,2,NA,4,NA,5)
y <- c("a", "b", NA, "d", NA, "f")
good <- complete.cases(x, y)
x[good]
y[good]

df[1:6,]
good <- complete.cases(df)
df[good,][1:6,]

x <- df[df$Month==5,]
summary(x$Ozone)

Data frames are used to store tabular data

1
2
3
4
5
6
row.names
read.table()
read.csv()
data.matrix()

x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

Names

1
2
3
4
5
6
7
8
9
x<-1:3
names(x)
names(x) <- c("foo", "bar", "norf")
x

x<-list(a=1,b=2,c=3)

m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))

2 Deal with data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
read.table()    #for reading tabular data
read.csv()
write.table()

readLines()     #for reading lines of a text file
writeLines()

source()        #for reading in R code files
dump()

dump(c("x", "y"), file = "data.R")
rm(x, y)
source("data.R")

dget()          #for reading in R code files 
dput()

load()          #for reading in saved workspaces
save()

unserialize()   #for reading single R objects in binary form
serialize()

Data are read in using connection interfaces.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
file     #opens a connection to a file
gzfile   #opens a connection to a file compressed with gzip
bzfile   #opens a connection to a file compressed with bzip2
url      #opens a connection to a webpage

str(file)
con <- file("foo.txt", "r")
data <- read.csv(con)
close(con)

con <- gzfile("words.gz")
x <- readLines(con, 10)

con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)

args(paste)
function (..., sep = " ", collapse = NULL)

I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?

1
1,500,000 × 120 × 8 bytes/numeric = 1.34 GB

3 Control structures

  • if, else: testing a condition
  • for: execute a loop a fixed number of times
  • while: execute a loop while a condition is true · repeat: execute an infinite loop
  • break: break the execution of a loop
  • next: skip an interation of a loop
  • return: exit a function

4 Looping on the Command Line

  • lapply: Loop over a list and evaluate a function on each element ·
  • sapply: Same as lapply but try to simplify the result
  • apply: Apply a function over the margins of an array
  • tapply: Apply a function over subsets of a vector
  • mapply: Multivariate version of lapply
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
x <- list(a = 1:5, b = rnorm(10))
lapply(x, mean)  #

x<-1:4
lapply(x, runif, min = 0, max = 10)

##### lapply and friends make heavy use of anonymous functions.
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[,1])

##### sapply will try to simplify the result of lapply if possible.
##### If the result is a list where every element is length 1, then a vector is returned
##### If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
##### If it can’t figure things out, a list is returned.
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
sapply(x, mean)

applay

apply is used to a evaluate a function (often an anonymous one) over the margins of an array.

  • It is most often used to apply a function to the rows or columns of a matrix.
  • It can be used with general arrays, e.g. taking the average of an array of matrices
  • It is not really faster than writing a loop, but it works in one line!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
str(apply)
x <- matrix(rnorm(200), 20, 10)
apply(x, 2, mean)
apply(x, 1, sum)

rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)

x <- matrix(rnorm(200), 20, 10)
apply(x, 1, quantile, probs = c(0.25, 0.75))

a <- array(rnorm(2 * 2 * 10), c(2, 2, 10))
apply(a, c(1, 2), mean)
rowMeans(a, dims = 2)

tapply

tpply is used to apply a function over subsets of a vector.

1
2
3
4
5
6
7
8
9
str(tapply)

x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
tapply(x, f, mean)


with(mtcars, tapply(hp, cyl, mean))
sapply(split(mtcars$hp, mtcars$cyl), mean)

split

split takes a vector or other objects and splits it into groups determined by a factor or list of factors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
str(split)
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
split(x, f)
lapply(split(x, f), mean)

##### splitting a Data Frame
library(datasets)
head(airquality)
s <- split(airquality, airquality$Month)
lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))

##### split more than one level
x <- rnorm(10)
f1<-gl(2,5)
f2<-gl(5,2)
interaction(f1, f2)
str(split(x, list(f1, f2)))
str(split(x, list(f1, f2), drop = TRUE))

maaply

mapply is a multivariate apply of sorts which applies a function in parallel over a set of arguments.

1
2
3
4
str(mapply)

list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))
mapply(rep, 1:4, 4:1)

reshape

1
2
3
4
5
6
7
library("reshape2")

##### metlting data frames
melt(mtcars, id=c("carname","gear","cyl"), measure.vars=c("mpg","hp"))

##### casting data frames
dcast(carMelt, cyl ~ variable, mean)

plyr

1
ddply()

str

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
str(str)
str(lm)
str(ls)

args(paste)
function (..., sep = " ", collapse = NULL)

x <- rnrom(100,2,4)
summary(x)
str(x)

f <- gl(40,10)
str(f)
summary(f)

### split()
### split takes a vector or other objects and splits it into groups determined by a factor or list of factors.
str(split)
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
split(x, f)

### A common idiom is split followed by an lapply
lapply(split(x, f), mean)

5 Generating random Numbers

  • d for density ->pdf
  • r for random number generation
  • p for cumulative distribution -> cdf
  • q for quantile function
1
2
3
4
5
6
7
8
9
10
11
12
dnorm(x, mean = 0, sd = 1, log = FALSE)
##### pnorm(q) = fi(q); qnorm(p) = fi(q)反函数
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

x <- rnorm(10,20,10)
summary(x)

### Setting the random number seed with set.seed ensures reproducibility
### Always set the random number seed when conducting a simulation!
set.seed(1)

Bayesian Genome Assembly and MCMC Assessment

Introduction

They first build an assembly graph starting from a de Bruijn graph of the reads. Then they remove all tips and merge all unambiguous paths into single nodes that are annotated by the sequence of merged K-mers.

The resulting unresolved assembly graph (no longer de Bruijn) is a directed graph that consists only of bubbles and is a minimal representation of the variants that can be inferred from the sequenced data. Concatenating the sequences across the nodes in a particular path through this graph gives a possible assembly sequence.

Read More

Python in a Nutshell (Network Tips)

The urlparse Module

1
2
3
urljoin(base_url_string,relative_url_string)
urlsplit(url_string,default_scheme='',allow_fragments=True)
urlunsplit(url_tuple)

The urllib Module

1
2
3
4
5
6
7
8
quote(str,safe='/')
quote_plus(str, safe='/')
unquote(str)
unquote_plus(str)
urlcleanup( )
urlencode(query,doseq=False)
urlopen(urlstring,data=None,proxies=None)
urlretrieve(urlstring,filename=None,reporthook=None,data=None)

Read More

Python in a Nutshell (Thread Tips)

A thread is a flow of control that shares global state with other threads; all threads appear to execute simultaneously.

A process is an instance of a running program.

创建线程的两种方法

  • 继承Thread类, 重写run()方法, 而不是start()
  • 创建threading.Thread对象,初始化函数__init__()中可将调用对象作为参数传入

The thread Module

1
2
3
L.acquire(wait=True) # When wait is True, acquire locks L.
L.locked( )          # Returns True if L is locked; otherwise, False.
L.release( )         # Unlocks L, which must be locked.

The Queue Module

The Queue module supplies first-in, first-out (FIFO) queues that support multithread access, with one main class and two exception classes.

1
2
3
4
5
6
7
8
class Queue(maxsize=0)
q.empty()
q.get(False)
q.full()
q.put(x,False)
q.get(block=True,timeout=None)
q.put(item,block=True,timeout=None)
q.qsize( )

Read More

Python in a Nutshell (Testing Tips)

Testing

Unit Testing and System Testing

Unit testing means writing and running tests to exercise a single module or an even smaller unit, such as a class or function.

System testing (also known as functional or integration testing) involves running an entire program with known inputs.

Some classic books on testing also draw the distinction between white-box testing, done with knowledge of a program’s internals, and black-box testing, done without such knowledge.

The doctest Module

is_palindrome.py

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
def is_palindrome(s):
	''' (str) -> bool
	Return Ture if and only if is a palindrome
	>>> is_palindrome('noon')
	True
	>>> is_palindrome('people')
	False
	>>> is_palindrome('catac')
	True
	'''
	
	i = 0
	j = len(s) - 1
	while (i<j) and (s[i] == s[j]):
		i = i + 1
		j = j - 1
	return j <= i

if __name__ == '__main__': 
    import doctest
    doctest.testmod( )

Read More

Python in a Nutshell (IO Tips)

File and Text Operations

1
open(filename, mode='r', bufsize=-1)

File mode

Mode Description
‘r’ The file must already exist, and it is opened in read-only mode.
‘w’ The file is opened in write-only mode. Overwritten or created.
‘a’ The file is opened in write-only mode. Appended or created.
‘r+’ The file must already exist and is opened for both reading and writing.
‘w+’ The file is opened for both reading and writing. Overwritten or created.
‘a+’ The file is opened for both reading and writing.Appended or created.
b Binary.
t Text.

Read More

SSPACE-LongRead: scaffolding with long reads

Introduction

We are happy to say that SSPACE is ready for dealing with the PacBio long reads for scaffolding[^1].

They proposed a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone.

The SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.

Read More

Data Analysis with Python (Useful Tips)

复制与引用

1
2
3
4
5
6
7
8
9
10
# a = b 赋值时,创建对象的新引用
# 不可变对象(数字和字符串),创建副本
# 可变对象(list 和 dict),创建引用,行为会有变化,危险
# 浅复制
a = [1,2,3,4]
b = list(a) # 共有元素部分会发生关联,危险

# 深复制
import copy
b = copy.deepcopy(a)
1
2
3
4
line        = "GOOD,100,490.10"
types       = [str,int,float]
raw_fields  = line.split(',')
fields      = [ty(vl) for ty,vl in zip(types,raw_field)]

collections

1
2
3
4
5
from collections import defaultdict
counts = defaultdict(int) # values will initialize to 0

from collections import Counter
counts = Counter(list)   # list 频数统计

pandas

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from pandas import DataFrame, Series
import pandas as pd
frame = DataFrame(records)
results = Series([x.split()[0] for x in frame.a.dropna()])
results.value_counts()[:8]

import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] 
users = pd.read_table('ml-1m/users.dat', sep='::', header=None,
names=unames)

data = pd.merge(pd.merge(ratings, users), movies)

names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', 'births'])
names1880.groupby('sex').births.sum()

numpy

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
import numpy as np
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2.ndim
arr2.shape
arr2.dtype


np.zeros(10)
np.zeros((2,3))
np.arange(15)

### Linear Algebra
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x.dot(y) # equivalently np.dot(x, y)

from numpy.linalg import inv, qr
X = randn(5, 5)
mat = X.T.dot(X)

inv(mat)
q, r = qr(mat)

pandas

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
#### Series
pd.Series([4, 7, -5, 3])
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2 = pd.Series(adict)

#### DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])

Mysql Notes

基本点

database
table
column
datatype
row
primary key(unique and non-null)
(不更新主键列中的值;不重用主键列的值;不在主键列中使用可能会更改的值;)

SQL(Structured Query Language)

MySQL Administrator
MySQL Query Browser

初见

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql -h 110.110.110.110 -uroot -p abcd123

mysqladmin -uroot -password ab12

grant select,insert,update,
delete on *.* to [email=test2@localhost][color=#355e9e]test2@localhost[/color][/email] identified by \"abc\";

create/drop database database-name;
show databases;
use database-name;
show tables;

show columns from table-name; <=> describe table-name;

show grants;    show status;    show errors;

Read More