Jul 7 2014

R

R Programming Tips

1 Data Types and Basic Operations

R has five basic or “atomic” classes of objects:

character
numeric (real numbers)
integer
complex
logical (True/False)

Vector and Matrix

vector()
x <- vector("numeric", length = 10)

c()
x <- c("a", "b", "c")
x <- c(TRUE, FALSE)
as.numeric(x)
as.logical(x)
as.character(x)

attributes() 
print()

m <- matrix(nrow = 2, ncol = 3)
m <- matrix(1:6, nrow = 2, ncol = 3)

m<-1:10
dim(m) <- c(2, 5)

cbind(x, y)
rbind(x, y)

list

Lists are a special type of vector that can contain elements of different classes.

1	x <- list(1, "a", TRUE, 1 + 4i)

Factor

Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.

lm()
glm()

x <- factor(c("yes", "yes", "no", "yes", "no"))
table(x)

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))

Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations.

x <- c(1,2,NaN,NA,4)
is.na(x)
is.nan(x)

x<-c(1,2,NA,4,NA,5)
bad <- is.na(x)
x[!bad]

#### What if there are multiple things and you want to take the subset with no missing values?
x<-c(1,2,NA,4,NA,5)
y <- c("a", "b", NA, "d", NA, "f")
good <- complete.cases(x, y)
x[good]
y[good]

df[1:6,]
good <- complete.cases(df)
df[good,][1:6,]

x <- df[df$Month==5,]
summary(x$Ozone)

Data frames are used to store tabular data

row.names
read.table()
read.csv()
data.matrix()

x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

Names

x<-1:3
names(x)
names(x) <- c("foo", "bar", "norf")
x

x<-list(a=1,b=2,c=3)

m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))

2 Deal with data

read.table()    #for reading tabular data
read.csv()
write.table()

readLines()     #for reading lines of a text file
writeLines()

source()        #for reading in R code files
dump()

dump(c("x", "y"), file = "data.R")
rm(x, y)
source("data.R")

dget()          #for reading in R code files 
dput()

load()          #for reading in saved workspaces
save()

unserialize()   #for reading single R objects in binary form
serialize()

Data are read in using connection interfaces.

file     #opens a connection to a file
gzfile   #opens a connection to a file compressed with gzip
bzfile   #opens a connection to a file compressed with bzip2
url      #opens a connection to a webpage

str(file)
con <- file("foo.txt", "r")
data <- read.csv(con)
close(con)

con <- gzfile("words.gz")
x <- readLines(con, 10)

con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)

args(paste)
function (..., sep = " ", collapse = NULL)

I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?

1	1,500,000 × 120 × 8 bytes/numeric = 1.34 GB

3 Control structures

if, else: testing a condition
for: execute a loop a fixed number of times
while: execute a loop while a condition is true · repeat: execute an infinite loop
break: break the execution of a loop
next: skip an interation of a loop
return: exit a function

4 Looping on the Command Line

lapply: Loop over a list and evaluate a function on each element ·
sapply: Same as lapply but try to simplify the result
apply: Apply a function over the margins of an array
tapply: Apply a function over subsets of a vector
mapply: Multivariate version of lapply

x <- list(a = 1:5, b = rnorm(10))
lapply(x, mean)  #

x<-1:4
lapply(x, runif, min = 0, max = 10)

##### lapply and friends make heavy use of anonymous functions.
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[,1])

##### sapply will try to simplify the result of lapply if possible.
##### If the result is a list where every element is length 1, then a vector is returned
##### If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
##### If it can’t figure things out, a list is returned.
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
sapply(x, mean)

applay

apply is used to a evaluate a function (often an anonymous one) over the margins of an array.

It is most often used to apply a function to the rows or columns of a matrix.
It can be used with general arrays, e.g. taking the average of an array of matrices
It is not really faster than writing a loop, but it works in one line!

str(apply)
x <- matrix(rnorm(200), 20, 10)
apply(x, 2, mean)
apply(x, 1, sum)

rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)

x <- matrix(rnorm(200), 20, 10)
apply(x, 1, quantile, probs = c(0.25, 0.75))

a <- array(rnorm(2 * 2 * 10), c(2, 2, 10))
apply(a, c(1, 2), mean)
rowMeans(a, dims = 2)

tapply

tpply is used to apply a function over subsets of a vector.

str(tapply)

x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
tapply(x, f, mean)


with(mtcars, tapply(hp, cyl, mean))
sapply(split(mtcars$hp, mtcars$cyl), mean)

split

split takes a vector or other objects and splits it into groups determined by a factor or list of factors.

str(split)
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
split(x, f)
lapply(split(x, f), mean)

##### splitting a Data Frame
library(datasets)
head(airquality)
s <- split(airquality, airquality$Month)
lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))

##### split more than one level
x <- rnorm(10)
f1<-gl(2,5)
f2<-gl(5,2)
interaction(f1, f2)
str(split(x, list(f1, f2)))
str(split(x, list(f1, f2), drop = TRUE))

maaply

mapply is a multivariate apply of sorts which applies a function in parallel over a set of arguments.

str(mapply)

list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))
mapply(rep, 1:4, 4:1)

reshape

library("reshape2")

##### metlting data frames
melt(mtcars, id=c("carname","gear","cyl"), measure.vars=c("mpg","hp"))

##### casting data frames
dcast(carMelt, cyl ~ variable, mean)

plyr

ddply()

str

str(str)
str(lm)
str(ls)

args(paste)
function (..., sep = " ", collapse = NULL)

x <- rnrom(100,2,4)
summary(x)
str(x)

f <- gl(40,10)
str(f)
summary(f)

### split()
### split takes a vector or other objects and splits it into groups determined by a factor or list of factors.
str(split)
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
split(x, f)

### A common idiom is split followed by an lapply
lapply(split(x, f), mean)

5 Generating random Numbers

d for density ->pdf
r for random number generation
p for cumulative distribution -> cdf
q for quantile function

dnorm(x, mean = 0, sd = 1, log = FALSE)
##### pnorm(q) = fi(q); qnorm(p) = fi(q)反函数
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

x <- rnorm(10,20,10)
summary(x)

### Setting the random number seed with set.seed ensures reproducibility
### Always set the random number seed when conducting a simulation!
set.seed(1)

Jul 2 2014

Bioinformatics►ML

Bayesian Genome Assembly and MCMC Assessment

Introduction

They first build an assembly graph starting from a de Bruijn graph of the reads. Then they remove all tips and merge all unambiguous paths into single nodes that are annotated by the sequence of merged K-mers.

The resulting unresolved assembly graph (no longer de Bruijn) is a directed graph that consists only of bubbles and is a minimal representation of the variants that can be inferred from the sequenced data. Concatenating the sequences across the nodes in a particular path through this graph gives a possible assembly sequence.

Jun 23 2014

Python

Python in a Nutshell (Network Tips)

The `urlparse` Module

1
2
3

urljoin(base_url_string,relative_url_string)
urlsplit(url_string,default_scheme='',allow_fragments=True)
urlunsplit(url_tuple)

The `urllib` Module

quote(str,safe='/')
quote_plus(str, safe='/')
unquote(str)
unquote_plus(str)
urlcleanup( )
urlencode(query,doseq=False)
urlopen(urlstring,data=None,proxies=None)
urlretrieve(urlstring,filename=None,reporthook=None,data=None)

Jun 23 2014

Python

Python in a Nutshell (Thread Tips)

A thread is a flow of control that shares global state with other threads; all threads appear to execute simultaneously.

A process is an instance of a running program.

创建线程的两种方法

继承Thread类, 重写run()方法, 而不是start()
创建threading.Thread对象,初始化函数__init__()中可将调用对象作为参数传入

The `thread` Module

1
2
3

L.acquire(wait=True) # When wait is True, acquire locks L.
L.locked( )          # Returns True if L is locked; otherwise, False.
L.release( )         # Unlocks L, which must be locked.

The `Queue` Module

The Queue module supplies first-in, first-out (FIFO) queues that support multithread access, with one main class and two exception classes.

class Queue(maxsize=0)
q.empty()
q.get(False)
q.full()
q.put(x,False)
q.get(block=True,timeout=None)
q.put(item,block=True,timeout=None)
q.qsize( )

Jun 23 2014

Python

Python in a Nutshell (Testing Tips)

Testing

Unit Testing and System Testing

Unit testing means writing and running tests to exercise a single module or an even smaller unit, such as a class or function.

System testing (also known as functional or integration testing) involves running an entire program with known inputs.

Some classic books on testing also draw the distinction between white-box testing, done with knowledge of a program’s internals, and black-box testing, done without such knowledge.

The `doctest` Module

is_palindrome.py

def is_palindrome(s):
	''' (str) -> bool
	Return Ture if and only if is a palindrome
	>>> is_palindrome('noon')
	True
	>>> is_palindrome('people')
	False
	>>> is_palindrome('catac')
	True
	'''
	
	i = 0
	j = len(s) - 1
	while (i<j) and (s[i] == s[j]):
		i = i + 1
		j = j - 1
	return j <= i

if __name__ == '__main__': 
    import doctest
    doctest.testmod( )

Jun 23 2014

Python

Python in a Nutshell (IO Tips)

File and Text Operations

1	open(filename, mode='r', bufsize=-1)

File mode

Mode	Description
‘r’	The file must already exist, and it is opened in read-only mode.
‘w’	The file is opened in write-only mode. Overwritten or created.
‘a’	The file is opened in write-only mode. Appended or created.
‘r+’	The file must already exist and is opened for both reading and writing.
‘w+’	The file is opened for both reading and writing. Overwritten or created.
‘a+’	The file is opened for both reading and writing.Appended or created.
b	Binary.
t	Text.

Jun 23 2014

Python

Python in a Nutshell (Useful Tips)

List

lambda Expressions

>>> aList = [1, 2, 3, 4, 5, 6, 7, 8, 9]
>>> low = 3
>>> high = 7
>>> filter(lambda x, l=low, h=high: h>x>l, aList)
[4, 5, 6]
# or
>>> def within_bounds(value, l=low, h=high):return h>value>l
>>> filter(within_bounds, aList)

Jun 22 2014

Bioinformatics

SSPACE-LongRead: scaffolding with long reads

Introduction

We are happy to say that SSPACE is ready for dealing with the PacBio long reads for scaffolding[^1].

They proposed a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone.

The SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.

Jun 20 2014

Data Mining

Data Analysis with Python (Useful Tips)

复制与引用

# a = b 赋值时，创建对象的新引用
# 不可变对象(数字和字符串)，创建副本
# 可变对象(list 和 dict),创建引用，行为会有变化，危险
# 浅复制
a = [1,2,3,4]
b = list(a) # 共有元素部分会发生关联，危险

# 深复制
import copy
b = copy.deepcopy(a)

line        = "GOOD,100,490.10"
types       = [str,int,float]
raw_fields  = line.split(',')
fields      = [ty(vl) for ty,vl in zip(types,raw_field)]

collections

from collections import defaultdict
counts = defaultdict(int) # values will initialize to 0

from collections import Counter
counts = Counter(list)   # list 频数统计

pandas

from pandas import DataFrame, Series
import pandas as pd
frame = DataFrame(records)
results = Series([x.split()[0] for x in frame.a.dropna()])
results.value_counts()[:8]

import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip'] 
users = pd.read_table('ml-1m/users.dat', sep='::', header=None,
names=unames)

data = pd.merge(pd.merge(ratings, users), movies)

names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', 'births'])
names1880.groupby('sex').births.sum()

numpy

import numpy as np
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)

data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2.ndim
arr2.shape
arr2.dtype


np.zeros(10)
np.zeros((2,3))
np.arange(15)

### Linear Algebra
x = np.array([[1., 2., 3.], [4., 5., 6.]])
y = np.array([[6., 23.], [-1, 7], [8, 9]])
x.dot(y) # equivalently np.dot(x, y)

from numpy.linalg import inv, qr
X = randn(5, 5)
mat = X.T.dot(X)

inv(mat)
q, r = qr(mat)

pandas

import pandas as pd
#### Series
pd.Series([4, 7, -5, 3])
obj2 = pd.Series([4, 7, -5, 3], index=['d', 'b', 'a', 'c'])
obj2 = pd.Series(adict)

#### DataFrame
data = {'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'], 'year': [2000, 2001, 2002, 2001, 2002],
'pop': [1.5, 1.7, 3.6, 2.4, 2.9]}
frame = pd.DataFrame(data)

frame2 = DataFrame(data, columns=['year', 'state', 'pop', 'debt'],
index=['one', 'two', 'three', 'four', 'five'])

Jun 17 2014

Tools

Mysql Notes

基本点

database
table
column
datatype
row
primary key(unique and non-null)
(不更新主键列中的值；不重用主键列的值；不在主键列中使用可能会更改的值；)

SQL(Structured Query Language)

MySQL Administrator
MySQL Query Browser

初见

mysql -h 110.110.110.110 -uroot -p abcd123

mysqladmin -uroot -password ab12

grant select,insert,update,
delete on *.* to [email=test2@localhost][color=#355e9e]test2@localhost[/color][/email] identified by \"abc\";

create/drop database database-name;
show databases;
use database-name;
show tables;

show columns from table-name; <=> describe table-name;

show grants;    show status;    show errors;

1 Data Types and Basic Operations

Vector and Matrix

list

Factor

Missing Values

Data frames are used to store tabular data

Names

2 Deal with data

Data are read in using connection interfaces.

3 Control structures

4 Looping on the Command Line

applay

tapply

split

maaply

reshape

plyr

str

5 Generating random Numbers

Introduction

The urlparse Module

The urllib Module

创建线程的两种方法

The thread Module

The Queue Module

Testing

Unit Testing and System Testing

The doctest Module

File and Text Operations

File mode

List

lambda Expressions

Introduction

复制与引用

collections

pandas

numpy

pandas

基本点

初见

The `urlparse` Module

The `urllib` Module

The `thread` Module

The `Queue` Module

The `doctest` Module