R Programming Tips

1 Data Types and Basic Operations

R has five basic or “atomic” classes of objects:

  • character
  • numeric (real numbers)
  • integer
  • complex
  • logical (True/False)

Vector and Matrix

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
vector()
x <- vector("numeric", length = 10)

c()
x <- c("a", "b", "c")
x <- c(TRUE, FALSE)
as.numeric(x)
as.logical(x)
as.character(x)

attributes() 
print()

m <- matrix(nrow = 2, ncol = 3)
m <- matrix(1:6, nrow = 2, ncol = 3)

m<-1:10
dim(m) <- c(2, 5)

cbind(x, y)
rbind(x, y)

list

Lists are a special type of vector that can contain elements of different classes.

1
x <- list(1, "a", TRUE, 1 + 4i)

Factor

Factors are used to represent categorical data. Factors can be unordered or ordered. One can think of a factor as an integer vector where each integer has a label.

1
2
3
4
5
6
7
lm()
glm()

x <- factor(c("yes", "yes", "no", "yes", "no"))
table(x)

x <- factor(c("yes", "yes", "no", "yes", "no"), levels = c("yes", "no"))

Missing Values

Missing values are denoted by NA or NaN for undefined mathematical operations.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
x <- c(1,2,NaN,NA,4)
is.na(x)
is.nan(x)

x<-c(1,2,NA,4,NA,5)
bad <- is.na(x)
x[!bad]

#### What if there are multiple things and you want to take the subset with no missing values?
x<-c(1,2,NA,4,NA,5)
y <- c("a", "b", NA, "d", NA, "f")
good <- complete.cases(x, y)
x[good]
y[good]

df[1:6,]
good <- complete.cases(df)
df[good,][1:6,]

x <- df[df$Month==5,]
summary(x$Ozone)

Data frames are used to store tabular data

1
2
3
4
5
6
row.names
read.table()
read.csv()
data.matrix()

x <- data.frame(foo = 1:4, bar = c(T, T, F, F))

Names

1
2
3
4
5
6
7
8
9
x<-1:3
names(x)
names(x) <- c("foo", "bar", "norf")
x

x<-list(a=1,b=2,c=3)

m <- matrix(1:4, nrow = 2, ncol = 2)
dimnames(m) <- list(c("a", "b"), c("c", "d"))

2 Deal with data

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
read.table()    #for reading tabular data
read.csv()
write.table()

readLines()     #for reading lines of a text file
writeLines()

source()        #for reading in R code files
dump()

dump(c("x", "y"), file = "data.R")
rm(x, y)
source("data.R")

dget()          #for reading in R code files 
dput()

load()          #for reading in saved workspaces
save()

unserialize()   #for reading single R objects in binary form
serialize()

Data are read in using connection interfaces.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
file     #opens a connection to a file
gzfile   #opens a connection to a file compressed with gzip
bzfile   #opens a connection to a file compressed with bzip2
url      #opens a connection to a webpage

str(file)
con <- file("foo.txt", "r")
data <- read.csv(con)
close(con)

con <- gzfile("words.gz")
x <- readLines(con, 10)

con <- url("http://www.jhsph.edu", "r")
x <- readLines(con)

args(paste)
function (..., sep = " ", collapse = NULL)

I have a data frame with 1,500,000 rows and 120 columns, all of which are numeric data. Roughly, how much memory is required to store this data frame?

1
1,500,000 × 120 × 8 bytes/numeric = 1.34 GB

3 Control structures

  • if, else: testing a condition
  • for: execute a loop a fixed number of times
  • while: execute a loop while a condition is true · repeat: execute an infinite loop
  • break: break the execution of a loop
  • next: skip an interation of a loop
  • return: exit a function

4 Looping on the Command Line

  • lapply: Loop over a list and evaluate a function on each element ·
  • sapply: Same as lapply but try to simplify the result
  • apply: Apply a function over the margins of an array
  • tapply: Apply a function over subsets of a vector
  • mapply: Multivariate version of lapply
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
x <- list(a = 1:5, b = rnorm(10))
lapply(x, mean)  #

x<-1:4
lapply(x, runif, min = 0, max = 10)

##### lapply and friends make heavy use of anonymous functions.
x <- list(a = matrix(1:4, 2, 2), b = matrix(1:6, 3, 2))
lapply(x, function(elt) elt[,1])

##### sapply will try to simplify the result of lapply if possible.
##### If the result is a list where every element is length 1, then a vector is returned
##### If the result is a list where every element is a vector of the same length (> 1), a matrix is returned.
##### If it can’t figure things out, a list is returned.
x <- list(a = 1:4, b = rnorm(10), c = rnorm(20, 1), d = rnorm(100, 5))
lapply(x, mean)
sapply(x, mean)

applay

apply is used to a evaluate a function (often an anonymous one) over the margins of an array.

  • It is most often used to apply a function to the rows or columns of a matrix.
  • It can be used with general arrays, e.g. taking the average of an array of matrices
  • It is not really faster than writing a loop, but it works in one line!
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
str(apply)
x <- matrix(rnorm(200), 20, 10)
apply(x, 2, mean)
apply(x, 1, sum)

rowSums = apply(x, 1, sum)
rowMeans = apply(x, 1, mean)
colSums = apply(x, 2, sum)
colMeans = apply(x, 2, mean)

x <- matrix(rnorm(200), 20, 10)
apply(x, 1, quantile, probs = c(0.25, 0.75))

a <- array(rnorm(2 * 2 * 10), c(2, 2, 10))
apply(a, c(1, 2), mean)
rowMeans(a, dims = 2)

tapply

tpply is used to apply a function over subsets of a vector.

1
2
3
4
5
6
7
8
9
str(tapply)

x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
tapply(x, f, mean)


with(mtcars, tapply(hp, cyl, mean))
sapply(split(mtcars$hp, mtcars$cyl), mean)

split

split takes a vector or other objects and splits it into groups determined by a factor or list of factors.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
str(split)
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
split(x, f)
lapply(split(x, f), mean)

##### splitting a Data Frame
library(datasets)
head(airquality)
s <- split(airquality, airquality$Month)
lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))

##### split more than one level
x <- rnorm(10)
f1<-gl(2,5)
f2<-gl(5,2)
interaction(f1, f2)
str(split(x, list(f1, f2)))
str(split(x, list(f1, f2), drop = TRUE))

maaply

mapply is a multivariate apply of sorts which applies a function in parallel over a set of arguments.

1
2
3
4
str(mapply)

list(rep(1, 4), rep(2, 3), rep(3, 2), rep(4, 1))
mapply(rep, 1:4, 4:1)

reshape

1
2
3
4
5
6
7
library("reshape2")

##### metlting data frames
melt(mtcars, id=c("carname","gear","cyl"), measure.vars=c("mpg","hp"))

##### casting data frames
dcast(carMelt, cyl ~ variable, mean)

plyr

1
ddply()

str

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
str(str)
str(lm)
str(ls)

args(paste)
function (..., sep = " ", collapse = NULL)

x <- rnrom(100,2,4)
summary(x)
str(x)

f <- gl(40,10)
str(f)
summary(f)

### split()
### split takes a vector or other objects and splits it into groups determined by a factor or list of factors.
str(split)
x <- c(rnorm(10), runif(10), rnorm(10, 1))
f<-gl(3,10)
split(x, f)

### A common idiom is split followed by an lapply
lapply(split(x, f), mean)

5 Generating random Numbers

  • d for density ->pdf
  • r for random number generation
  • p for cumulative distribution -> cdf
  • q for quantile function
1
2
3
4
5
6
7
8
9
10
11
12
dnorm(x, mean = 0, sd = 1, log = FALSE)
##### pnorm(q) = fi(q); qnorm(p) = fi(q)反函数
pnorm(q, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
qnorm(p, mean = 0, sd = 1, lower.tail = TRUE, log.p = FALSE)
rnorm(n, mean = 0, sd = 1)

x <- rnorm(10,20,10)
summary(x)

### Setting the random number seed with set.seed ensures reproducibility
### Always set the random number seed when conducting a simulation!
set.seed(1)