R allows loops but they are generally avoided since they can be speeded up with the use of vectorial functions, that is functions that are applied on vectors of data. These are functions of the apply family such as apply, lapply, sapply and tapply.
apply() The generic apply() function is used to apply a function on a vector. It thus returns a vector, after applying a function either a column or a row of a matrix. The choice of the which is performed through the setting of a dimension parameter.
x<-cbind(seq(from=1, to=10,by=1), seq(from=2, to=20, by=2), seq(from=3, to=30, by=3))
sumbyrow=apply(x, 1, sum)
sumbyrow
## [1] 6 12 18 24 30 36 42 48 54 60
sumbycol=apply(x, 2, sum)
sumbycol
## [1] 55 110 165
in the example above we apply the sum() function on the rows (1) and the columns of the matrix x. There is already a set of functions that do this (rowSums, colSums), but the apply function does not stop there. You can use any function you want such as sort:
x<-cbind(runif(10, min=1, max=10), runif(10, min=0, max=1), runif(10, min=-5, max=5))
x
## [,1] [,2] [,3]
## [1,] 6.233936 0.92517910 -3.53065823
## [2,] 4.588225 0.05061863 -2.32380091
## [3,] 8.470859 0.42895311 4.15892935
## [4,] 6.001513 0.32978241 -4.17910700
## [5,] 6.513536 0.39151773 -1.27267565
## [6,] 8.189752 0.64284462 3.81959153
## [7,] 4.399496 0.88432248 -4.13589657
## [8,] 4.369471 0.92203103 -0.02080125
## [9,] 7.961910 0.04349732 2.97137155
## [10,] 1.206806 0.95901197 -4.65653554
sortbycol=apply(x, 2, sort)
sortbycol
## [,1] [,2] [,3]
## [1,] 1.206806 0.04349732 -4.65653554
## [2,] 4.369471 0.05061863 -4.17910700
## [3,] 4.399496 0.32978241 -4.13589657
## [4,] 4.588225 0.39151773 -3.53065823
## [5,] 6.001513 0.42895311 -2.32380091
## [6,] 6.233936 0.64284462 -1.27267565
## [7,] 6.513536 0.88432248 -0.02080125
## [8,] 7.961910 0.92203103 2.97137155
## [9,] 8.189752 0.92517910 3.81959153
## [10,] 8.470859 0.95901197 4.15892935
or other functions that check value existence such as is.na(). apply() can also implement your own functions which you can write like this:
zscore<-function(x){z<-(x-mean(x))/sd(x); return(z)}
x<-cbind(runif(10, min=1, max=10), runif(10, min=0, max=1), runif(10, min=-5, max=5))
apply(x, 2, zscore)
## [,1] [,2] [,3]
## [1,] -0.66031129 -1.3607870 0.04029947
## [2,] -0.30519070 -1.0520945 0.56351343
## [3,] 1.84255034 -0.8817110 -0.25350897
## [4,] -0.03739408 -0.5676326 -1.37671324
## [5,] -1.06890576 0.6562294 -1.28089348
## [6,] -0.93816145 0.6447516 1.62622704
## [7,] 0.05305558 -0.6758497 1.02163249
## [8,] -0.77490501 1.0074008 0.60415842
## [9,] 1.52073982 0.8521693 0.06659337
## [10,] 0.36852255 1.3775237 -1.01130853
tapply() The tapply function is useful when we need to break up a vector into groups defined by some classifying factor, compute a function on the subsets, and return the results in a convenient form. You can even specify multiple factors as the grouping variable, for example treatment and sex, or team and handedness. For example we can use tapply on the iris dataset to extract the means of for Sepal length for each species
tapply(iris$Sepal.Length, iris$Species, mean)
## setosa versicolor virginica
## 5.006 5.936 6.588
while this can also be done for vector functions such as the one we saw before (zscore)
tapply(iris$Sepal.Length, iris$Species, zscore)
## $setosa
## [1] 0.26667447 -0.30071802 -0.86811050 -1.15180675 -0.01702177
## [6] 1.11776320 -1.15180675 -0.01702177 -1.71919923 -0.30071802
## [11] 1.11776320 -0.58441426 -0.58441426 -2.00289548 2.25254817
## [16] 1.96885193 1.11776320 0.26667447 1.96885193 0.26667447
## [21] 1.11776320 0.26667447 -1.15180675 0.26667447 -0.58441426
## [26] -0.01702177 -0.01702177 0.55037071 0.55037071 -0.86811050
## [31] -0.58441426 1.11776320 0.55037071 1.40145944 -0.30071802
## [36] -0.01702177 1.40145944 -0.30071802 -1.71919923 0.26667447
## [41] -0.01702177 -1.43550299 -1.71919923 -0.01702177 0.26667447
## [46] -0.58441426 0.26667447 -1.15180675 0.83406695 -0.01702177
##
## $versicolor
## [1] 2.06133180 0.89892665 1.86759761 -0.84468108 1.09266084
## [6] -0.45721269 0.70519246 -2.00708623 1.28639503 -1.42588365
## [11] -1.81335204 -0.06974431 0.12398988 0.31772407 -0.65094688
## [16] 1.48012923 -0.65094688 -0.26347850 0.51145827 -0.65094688
## [21] -0.06974431 0.31772407 0.70519246 0.31772407 0.89892665
## [26] 1.28639503 1.67386342 1.48012923 0.12398988 -0.45721269
## [31] -0.84468108 -0.84468108 -0.26347850 0.12398988 -1.03841527
## [36] 0.12398988 1.48012923 0.70519246 -0.65094688 -0.84468108
## [41] -0.84468108 0.31772407 -0.26347850 -1.81335204 -0.65094688
## [46] -0.45721269 -0.45721269 0.51145827 -1.61961784 -0.45721269
##
## $virginica
## [1] -0.4529159 -1.2392283 0.8051839 -0.4529159 -0.1383910 1.5914963
## [7] -2.6545906 1.1197088 0.1761340 0.9624464 -0.1383910 -0.2956535
## [13] 0.3333965 -1.3964908 -1.2392283 -0.2956535 -0.1383910 1.7487587
## [19] 1.7487587 -0.9247034 0.4906589 -1.5537533 1.7487587 -0.4529159
## [25] 0.1761340 0.9624464 -0.6101784 -0.7674409 -0.2956535 0.9624464
## [31] 1.2769713 2.0632837 -0.2956535 -0.4529159 -0.7674409 1.7487587
## [37] -0.4529159 -0.2956535 -0.9247034 0.4906589 0.1761340 0.4906589
## [43] -1.2392283 0.3333965 0.1761340 0.1761340 -0.4529159 -0.1383910
## [49] -0.6101784 -1.0819658
lapply() and sapply() lapply traverses vectors or lists, applies a function and produces output in the form of a list.
x<-cbind(runif(10, min=1, max=10), runif(10, min=0, max=1), runif(10, min=-5, max=5))
lapply(x, function(x) x^2)
## [[1]]
## [1] 4.818479
##
## [[2]]
## [1] 70.68591
##
## [[3]]
## [1] 5.035471
##
## [[4]]
## [1] 38.69656
##
## [[5]]
## [1] 20.12575
##
## [[6]]
## [1] 9.577096
##
## [[7]]
## [1] 77.16921
##
## [[8]]
## [1] 75.38829
##
## [[9]]
## [1] 42.4735
##
## [[10]]
## [1] 35.53796
##
## [[11]]
## [1] 0.4844295
##
## [[12]]
## [1] 0.01983359
##
## [[13]]
## [1] 0.5607224
##
## [[14]]
## [1] 0.9464368
##
## [[15]]
## [1] 0.02081824
##
## [[16]]
## [1] 0.1278128
##
## [[17]]
## [1] 0.9977127
##
## [[18]]
## [1] 0.000344808
##
## [[19]]
## [1] 0.003600031
##
## [[20]]
## [1] 0.001674236
##
## [[21]]
## [1] 6.411431
##
## [[22]]
## [1] 12.77393
##
## [[23]]
## [1] 12.2411
##
## [[24]]
## [1] 1.83931
##
## [[25]]
## [1] 0.5421461
##
## [[26]]
## [1] 8.009984
##
## [[27]]
## [1] 16.26753
##
## [[28]]
## [1] 7.719684
##
## [[29]]
## [1] 13.3521
##
## [[30]]
## [1] 5.920523
while sapply produces the same output in a simplified format of a vector
x<-cbind(runif(10, min=1, max=10), runif(10, min=0, max=1), runif(10, min=-5, max=5))
sapply(x, function(x) x^2)
## [1] 2.402993e+01 9.078173e+01 2.488532e+00 8.920271e+01 4.558367e+00
## [6] 1.886573e+01 6.583063e+01 1.165476e+01 2.064273e+01 6.441983e+01
## [11] 3.031566e-01 3.302035e-01 9.610846e-01 8.316291e-01 8.942871e-02
## [16] 8.719708e-04 6.549580e-01 6.677254e-03 4.525669e-03 2.955900e-02
## [21] 2.599822e-02 2.433593e+01 7.474564e-01 5.958990e-02 1.665336e+01
## [26] 1.158597e+00 1.510537e+01 1.045811e+01 1.343483e+00 4.748700e-01
One of R’s main powerful is the implementation of statistical tests. Most of the functions are of the form “function”.test(). Below we briefly discuss only some of them:
t.test() checks the difference of means between two samples (x,y). It can be paired or unpaired but always assumes that x and y follow the normal distribution.
x<-rnorm(100, mean=1, sd=1)
y<-rnorm(100, mean=0.9, sd=1)
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 0.85955, df = 187.51, p-value = 0.3911
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## -0.1663465 0.4232469
## sample estimates:
## mean of x mean of y
## 1.137058 1.008608
the output contains the two means, the p-value and df but even more importantly the confidence intervals. Compare the dependence of the p-value on the sample size here:
x<-rnorm(1000, mean=1, sd=1)
y<-rnorm(1000, mean=0.9, sd=1)
t.test(x,y)
##
## Welch Two Sample t-test
##
## data: x and y
## t = 2.0096, df = 1997.6, p-value = 0.04461
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
## 0.002149114 0.176157078
## sample estimates:
## mean of x mean of y
## 0.9659543 0.8768012
We can test the correlation between two paired values with the functions cor() and cor.test()
data<-iris[,1:2]
cor(data[,1], data[,2])
## [1] -0.1175698
but cor() may be applied on matrices as well, calculating the correlation of all against all:
data<-iris[,1:4]
cor(data)
## Sepal.Length Sepal.Width Petal.Length Petal.Width
## Sepal.Length 1.0000000 -0.1175698 0.8717538 0.8179411
## Sepal.Width -0.1175698 1.0000000 -0.4284401 -0.3661259
## Petal.Length 0.8717538 -0.4284401 1.0000000 0.9628654
## Petal.Width 0.8179411 -0.3661259 0.9628654 1.0000000
cor.test() will test the significance of the correlation
data<-iris[,1:2]
cor.test(data[,1], data[,2])
##
## Pearson's product-moment correlation
##
## data: data[, 1] and data[, 2]
## t = -1.4403, df = 148, p-value = 0.1519
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.27269325 0.04351158
## sample estimates:
## cor
## -0.1175698
R supports if, if/else conditional structures, which can be introduced with the usual notation of paretheses and curly brackets like this:
x <- 100
if(x > 0) {
print("x is positive")
}
## [1] "x is positive"
if/else structures are no different. Check this silly example:
x <- c("Sakis","Makis","Takis")
if("Lakis" %in% x) {
print("Lakis is here")
} else if ("lakis" %in% x) {
print("lakis exists instead")
} else {
print("L/lakis is not here")
}
## [1] "L/lakis is not here"
A switch command also exists. We very often use it to branch commands based on some statement, thus introducing an “option”. See this example:
x<-3
switch(x, case1=1, case2=3, case3=2.5, 99)
## [1] 2.5
x<-""
switch(x, case1=1, case2=3, case3=2.5, 99)
## [1] 99
switch() takes the first argument and “maps” it to the possible choices according to its indexing (this is 1-based as usually in R). Notice how the last (un-named option) is assigned to the empty argument.
Are supports looping. Even though it is not highly recommended to use loops as they are notoriously slower than other R functions that can do the same thing through vectors. Nonetheless, some times it is hard to avoid them and for small datasets they work just fine. There are three types of loop structures in R. for{}, while{} and repeat{}. for{} works like this:
x<-runif(10)
for (i in x){
print(i)
}
## [1] 0.6430267
## [1] 0.5145846
## [1] 0.9380069
## [1] 0.773307
## [1] 0.3018448
## [1] 0.7608309
## [1] 0.7438895
## [1] 0.6617608
## [1] 0.5006346
## [1] 0.5341947
of course there are variant to how you use for, which means you can do things in a more controlled manner. Say you only want to loop over a number of elements of x and calculate their squares in another vector:
x<-runif(100)
sqx<-vector(mode="numeric", length=10)
for (i in 20:29){
sqx[i-19]=x[i]**2
}
cbind(x[20:29],sqx)
## sqx
## [1,] 0.226462885 5.128544e-02
## [2,] 0.040234208 1.618791e-03
## [3,] 0.038537353 1.485128e-03
## [4,] 0.096588138 9.329268e-03
## [5,] 0.562012086 3.158576e-01
## [6,] 0.580068226 3.364791e-01
## [7,] 0.957621434 9.170388e-01
## [8,] 0.833922587 6.954269e-01
## [9,] 0.006189494 3.830984e-05
## [10,] 0.989307176 9.787287e-01
while{} loop are used with a built-in condition
x<-runif(100)
sqx<-vector(mode="numeric", length=10)
i<-20
while (i < 30) {
sqx[i-19]=x[i]**2
i <- i + 1
}
cbind(x[20:29],sqx)
## sqx
## [1,] 0.98593838 0.9720744818
## [2,] 0.60326951 0.3639340989
## [3,] 0.03034164 0.0009206152
## [4,] 0.23690031 0.0561217584
## [5,] 0.12466660 0.0155417617
## [6,] 0.82859775 0.6865742271
## [7,] 0.71062064 0.5049816888
## [8,] 0.54834563 0.3006829274
## [9,] 0.30218191 0.0913139046
## [10,] 0.81888031 0.6705649557
repeat{} is similar to the while loop but requires a loop-exit statement such as break. Here’s how the example above would work:
x<-runif(100)
sqx<-vector(mode="numeric", length=10)
i<-20
repeat {
sqx[i-19]=x[i]**2
i <- i + 1
if (i==30){
break
}
}
cbind(x[20:29],sqx)
## sqx
## [1,] 0.15294414 0.023391910
## [2,] 0.93003107 0.864957799
## [3,] 0.64265924 0.413010899
## [4,] 0.91404158 0.835472001
## [5,] 0.83686284 0.700339407
## [6,] 0.89965738 0.809383400
## [7,] 0.10072630 0.010145788
## [8,] 0.07070094 0.004998624
## [9,] 0.19030764 0.036216996
## [10,] 0.61780118 0.381678295
Notice how repeat needs to introduce an additional if() statement to break the loop and exit. Apart from break there is also a next statement that sends the loop to the following iteration without completing the rest of the statements.
v <- LETTERS[1:6] # first six upper-case letters
for ( i in v) {
if (i == "D") {
next
}
print(i)
}
## [1] "A"
## [1] "B"
## [1] "C"
## [1] "E"
## [1] "F"
You can set up your own functions like this
zscore<-function(x){
z<-((x-mean(x))/sd(x));
return(z)
}
zscore() is now a function that takes one vector as argument (nobody tells you that, but you wrote it so you should know). You can call it like this:
x<-runif(100)
z<-zscore(x)
head(cbind(x,z))
## x z
## [1,] 0.8991517 1.26996925
## [2,] 0.9860789 1.56064382
## [3,] 0.4696269 -0.16631318
## [4,] 0.7470079 0.76121728
## [5,] 0.5393775 0.06692476
## [6,] 0.6549425 0.45336129
Install packages with:
install.packages("beanplot")
and then load them with
library("beanplot")
The second command needs to be run every time you restart R while the first is only required once, in order to fetch and install the library/package to your local R depository.