vector()
x <- vector("numeric", length = 10)c()
x <- c("a", "b", "c")
x <- c(TRUE, FALSE)
as.numeric(x)
as.logical(x)
as.character(x)
attributes()print()
m <- matrix(nrow = 2, ncol = 3)
m <- matrix(1:6, nrow = 2, ncol = 3)
m<-1:10
dim(m) <- c(2, 5)
cbind(x, y)
rbind(x, y)
list
Lists are a special type of vector that can contain elements of different classes.
1
x <- list(1, "a", TRUE, 1 + 4i)
Factor
Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.
1
2
3
4
5
6
7
lm()glm()
x <- factor(c("yes", "yes", "no", "yes", "no"))table(x)
x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))
Missing Values
Missing values are denoted by NA or NaN for undefined mathematical operations.
x <- c(1,2,NaN,NA,4)
is.na(x)
is.nan(x)
x<-c(1,2,NA,4,NA,5)
bad <- is.na(x)
x[!bad]
#### What if there are multiple things and you want to take the subset with no missing values?
x<-c(1,2,NA,4,NA,5)
y <- c("a", "b", NA, "d", NA, "f")
good <- complete.cases(x, y)
x[good]
y[good]
df[1:6,]
good <- complete.cases(df)
df[good,][1:6,]
x <- df[df$Month==5,]summary(x$Ozone)
read.table() #for reading tabular data
read.csv()
write.table()
readLines() #for reading lines of a textfile
writeLines()
source() #for reading in R code files
dump()
dump(c("x", "y"), file = "data.R")
rm(x, y)
source("data.R")
dget() #for reading in R code files
dput()
load() #for reading in saved workspaces
save()
unserialize() #for reading single R objects in binary form
serialize()
Data are read in using connection interfaces.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
file #opens a connection to a file
gzfile #opens a connection to a file compressed with gzip
bzfile #opens a connection to a file compressed with bzip2
url #opens a connection to a webpage
str(file)
con <- file("foo.txt", "r")
data <- read.csv(con)
close(con)
con <- gzfile("words.gz")
x <- readLines(con, 10)
con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)
args(paste)
function (..., sep = " ", collapse = NULL)
I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?
1
1,500,000 × 120 × 8bytes/numeric = 1.34 GB
3 Control structures
if, else: testing a condition
for: execute a loop a fixed number of times
while: execute a loop while a condition is true · repeat: execute an infinite loop
break: break the execution of a loop
next: skip an interation of a loop
return: exit a function
4 Looping on the Command Line
lapply: Loop over a list and evaluate a function on each element ·
sapply: Same as lapply but try to simplify the result
apply: Apply a function over the margins of an array
tapply: Apply a function over subsets of a vector
mapply: Multivariate version of lapply
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
x <- list(a = 1:5, b = rnorm(10))
lapply(x, mean) #
x<-1:4
lapply(x, runif, min = 0, max = 10)
##### lapply and friends make heavy useof anonymous functions.
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[,1])
##### sapply will try to simplify the result of lapply if possible.
##### If the result is a list where every element is length 1, then a vector is returned
##### If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
##### If it can’t figure things out, a list is returned.
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
sapply(x, mean)
applay
apply is used to a evaluate a function (often an anonymous one) over the margins of an array.
It is most often used to apply a function to the rows or columns of a matrix.
It can be used with general arrays, e.g. taking the average of an array of matrices
It is not really faster than writing a loop, but it works in one line!
dnorm(x, mean = 0, sd = 1, log = FALSE)
##### pnorm(q) = fi(q); qnorm(p) = fi(q)反函数
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)
x <- rnorm(10,20,10)
summary(x)
### Setting the random number seed with set.seed ensures reproducibility### Always set the random number seed when conducting a simulation!set.seed(1)
They first build an assembly graph starting from a de Bruijn graph of the reads. Then they remove all tips and merge all unambiguous paths into single nodes that are annotated by the sequence of merged K-mers.
The resulting unresolved assembly graph (no longer de Bruijn) is a directed graph that consists only of bubbles and is a minimal representation of the variants that can be inferred from the sequenced data. Concatenating the sequences across the nodes in a particular path through this graph gives a possible assembly sequence.
A thread is a flow of control that shares global state with other threads; all threads appear to execute simultaneously.
A process is an instance of a running program.
创建线程的两种方法
继承Thread类, 重写run()方法, 而不是start()
创建threading.Thread对象,初始化函数__init__()中可将调用对象作为参数传入
The thread Module
1
2
3
L.acquire(wait=True) # When wait isTrue, acquire locks L.
L.locked( ) # Returns Trueif L islocked; otherwise, False.
L.release( ) # Unlocks L, which must be locked.
The Queue Module
The Queue module supplies first-in, first-out (FIFO) queues that support multithread access, with one main class and two exception classes.
Unit testing means writing and running tests to exercise a single module or an even smaller unit, such as a class or function.
System testing (also known as functional or integration testing) involves running an entire program with known inputs.
Some classic books on testing also draw the distinction between white-box testing, done with knowledge of a program’s internals, and black-box testing, done without such knowledge.
defis_palindrome(s):''' (str) -> bool
Return Ture if and only if is a palindrome
>>> is_palindrome('noon')
True
>>> is_palindrome('people')
False
>>> is_palindrome('catac')
True
'''
i = 0
j = len(s) - 1while (i<j) and (s[i] == s[j]):
i = i + 1
j = j - 1return j <= i
if __name__ == '__main__':
import doctest
doctest.testmod( )
We are happy to say that SSPACE is ready for dealing with the PacBio long reads for scaffolding[^1].
They proposed a novel hybrid assembly methodology that aims to scaffold pre-assembled contigs in an iterative manner using PacBio RS long read information as a backbone.
The SSPACE-LongRead software which is designed to upgrade incomplete draft genomes using single molecule sequences. We conclude that the recent advances of the PacBio sequencing technology and chemistry, in combination with the limited computational resources required to run our program, allow to scaffold genomes in a fast and reliable manner.
# a = b 赋值时,创建对象的新引用# 不可变对象(数字和字符串),创建副本# 可变对象(list 和 dict),创建引用,行为会有变化,危险# 浅复制
a = [1,2,3,4]
b = list(a) # 共有元素部分会发生关联,危险
# 深复制
import copy
b = copy.deepcopy(a)
1
2
3
4
line = "GOOD,100,490.10"
types = [str,int,float]
raw_fields = line.split(',')
fields = [ty(vl) for ty,vl in zip(types,raw_field)]
collections
1
2
3
4
5
from collections import defaultdict
counts = defaultdict(int) # values will initialize to 0from collections import Counter
counts = Counter(list) # list 频数统计
pandas
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from pandas import DataFrame, Series
import pandas as pd
frame = DataFrame(records)
results = Series([x.split()[0] for x in frame.a.dropna()])
results.value_counts()[:8]
import pandas as pd
unames = ['user_id', 'gender', 'age', 'occupation', 'zip']
users = pd.read_table('ml-1m/users.dat', sep='::', header=None,
names=unames)
data = pd.merge(pd.merge(ratings, users), movies)
names1880 = pd.read_csv('names/yob1880.txt', names=['name', 'sex', 'births'])
names1880.groupby('sex').births.sum()
database table column datatype row primary key(unique and non-null) (不更新主键列中的值;不重用主键列的值;不在主键列中使用可能会更改的值;)
SQL(Structured Query Language)
MySQL Administrator MySQL Query Browser
初见
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
mysql -h 110.110.110.110 -uroot -p abcd123
mysqladmin -uroot -password ab12
grant select,insert,update,
delete on *.* to [email=test2@localhost][color=#355e9e]test2@localhost[/color][/email] identified by \"abc\";
create/drop database database-name;
show databases;
use database-name;
show tables;
show columns from table-name; <=> describe table-name;
show grants; show status; show errors;