R is a language and environment for statistical computing and graphics. It is a GNU project which is similar to the S language and environment which was developed at Bell Laboratories (formerly AT&T, now Lucent Technologies) by John Chambers and colleagues. R can be considered as a different implementation of S. There are some important differences, but much code written for S runs unaltered under R.
R is an integrated suite of software facilities for data manipulation, calculation and graphical display. It includes: * an effective data handling and storage facility, * a suite of operators for calculations on arrays, in particular matrices, * a large, coherent, integrated collection of intermediate tools for data analysis, * graphical facilities for data analysis and display either on-screen or on hardcopy, and * a well-developed, simple and effective programming language which includes conditionals, loops, user-defined recursive functions and input and output facilities.
The term “environment” is intended to characterize it as a fully planned and coherent system, rather than an incremental accretion of very specific and inflexible tools, as is frequently the case with other data analysis software. Many users think of R as a statistics system. We prefer to think of it of an environment within which statistical techniques are implemented. R can be extended (easily) via packages. There are about eight packages supplied with the R distribution and many more are available through the CRAN family of Internet sites covering a very wide range of modern statistics.
Is very easy. In Linux systems just type: > sudo apt-get install r-base-core
in your terminal. Alternatively you can install R via the software repositories
In both Linux and Windows you can try R-Studio, a GUI-style R with split windows holding history, plots and file separately for advanced control (at a certain expense of speed and memory) Get it here: http://www.rstudio.com/ By downloading it via this link: http://www.rstudio.com/products/rstudio/download/
Is even easier. In Linux you simply type:
R
in your terminal and the environment loads automatically. You can now start working.
R-Studio starts either via the command line or by clicking the approprate button on your programs folder.
R is open source, which makes it very easy to access the necessary information. There is a great number of online manuals and free books for using and programming R. The R development team has an extended list here http://www.r-project.org/doc/bib/R-books.html
The standard documentation for each function pops up in R simply by typing
?nameoffunction
or, in case the function is not installed, you can get additional info on packages and modules by typing:
??nameoffunction
When everything else fails you can seek expert advise online in the R-help and R-devel mailing lists as well as in various fora (e.g. http://stackoverflow.com/ etc). Plus there is always google.
Entering input is quite straightforward. You can simply type in data in the R environment using the symbol “<-” as the assignment operator. For example:
5->x;
Assignment of a value works both ways as long as it is the value that is being assigned to the variable:
5->x;
x<-5;
x->5; # this will not work
Seeing what each variable contains is easy. Simply type the name of the variable and press enter
x
## [1] 5
Input can be of any type, including characters, logical etc:
names<-"Christoforos";
status<-"False";
Integer sequences in the form of vectors can be incorporated with the specific operator (:)
x <- 1:20
Simple variables can be dealt with by R in all sense of arithmetical operations.
x <- 2
y <- x+2 # addition/subtraction
y
## [1] 4
z <- x*y # multiplication
z
## [1] 8
d <- z/4 # division
d
## [1] 2
d**1.5 # power
## [1] 2.828427
Precedence is based on the normal mathematical rules of precedence (**, */, +-) so you must always use brackets when coding a more complicated formula. Brackets are used in a nested manner like:
p <- 2*((x-y)**2)-3.14
R carries a great number of predefined arithmetical functions for basic operations such as square root, logarithms etc
x <- 64
sqrt(x) # square root of x
## [1] 8
y<- -x
y
## [1] -64
y<- x
abs(y) # absolute value of -8
## [1] 64
log(y) # natural logarithm
## [1] 4.158883
log(y)/log(2) # changing the base to log2
## [1] 6
Now, can you think about a way to get the cubic root of a simple variable x?
Simply with inputting data from the command line, means we have seen nothing of R’s power yet. R is able to handle great chunks of data in various levels of organization and the best way to feed them is by making R read them from a file stored in our computer. There are numerous ways to do so depending on the format, size and type of the data as well as on the downstream analyses we intend to conduct. In the following we will take a look at the most common ones. Simply invoke it with:
data<-read.table("myfile.txt")
Keep in mind that myfile.txt needs to be in the directory you are currently working in. In any other case you will need the full path of the file such as e.g.
data<-read.table("~/Documents/R/myfile.txt")
R will read through the file, skip any line starting with a comment hash “#” and will try to store the values read in the most convenient form, which is usually a data frame. We can now check what the data frame holds by asking R to return the first rows using head() or the last ones using tail()
head(data)
tail(data)
If we want to be more specific we can ask R to return only a specific subset of the data frame’s columns, or rows, or even combinations of the two. But let’s leave this for later chapters. Reading through big files can take time even with R (or especially with R) as it tries to make some inference on the data “on the fly” as the file is being read. In this way, R tries to figure out the column separator (if data come in columns separated by space, commas, tabs etc) the class of the data in each column and store the whole file in a data type. For all the above reasons, it is important that we make R’s job easier by providing some of that information ourselves. Both read.table and read.delim allow us to provide additional information before reading the file. In particular there are some important attributes/options we can activate that are related to: * the column separator sep * the header of the columns header * the number of rows nrow
Lets try to let R know that we want to read a file using tab as the column separator, keeping the first line of the file as column header and reading 1000 rows. We can do this with read.delim()
data<-read.delim("myfile.txt", header=T, sep="\t", nrow=1000)
This will keep the first line as column headers and will stop reading after the first 1000 lines (excluding the header). Each column will be read if it is tab-separated by the previous one. There are a number of ways to separate columns in tables, the most common ones being space, tab and commas. You may often (or not so often) see filenames ending with the *.tsv or *.csv extensions. These indicate tab-separated-values or comma-separated-values. R has a particular function to read the latter called read.csv()
data<-read.csv("myfile.csv")
will read and store data directly in columns as long as they are comma separated. readLines() is a function that reads files line-by-line storing each line in a separate vector element. The output is thus a vector holding the lines of the file in the order they appear in it. This may be useful for text mining purposes but not so much when the data are numeric and you want to store them in data frames or matrices.
text<-readLines("file.txt")
text<-readLines("file.txt", 100)
The second command will only read the first 100 lines of the file. Advanced reading can also be performed with the use of specialized functions, contained in R libraries. One such, allows the user to import Excel files
library(xlsx) # more about this later
mydata <- read.xlsx("c:/myexcel.xlsx", 1)
In many cases, the files fed into R will contain “holes”, incorrectly formatted values or values that cannot be treated numerically. R is not able to understand what you meant with a funny character, or lack thereof, but is “clever” enough to mark the value with a “NA” or a “NaN” character. “NA” signifies a missing value (a hole in the data table) while “NaN” stands for not-a-number and is returned when a mathematical operation is non-sensical (e.g. a division with 0). Be extremely careful with both NA and NaN values as they may either inhibit the execution of certain functions (the good scenario, because you notice the error immediately) or make functions return erroneous values (the bad scenario, because you may not always notice the error). Always remove NA values or at least mark them out of calculations. You can test if a variable has a NA, NaN status simply by asking R:
is.na(x)
is.nan(x)
in which case R will return TRUE or FALSE. More on this in the chapter of Subsetting
We saw how we feed data into R, how we make sure non-sensical values are not included but now how about getting data out of R and into a file in our computer? R has specific functions for writing data to files, most of which are perfectly symmetrical to the reading ones. Thus if we want to write output to a file we can make use of the write function:
write(data, file="out.txt")
which can be made much more elaborate if additional options are fed in the command
write(data, file="out.txt", append=T, sep="\t", ncolumns=3)
this will not only write the “data” to a file names “out.txt” but it will further append the data at the end of this file if it already exists. Moreover, data will be written in 3 columns separated with tab. Although the command above will work most of the times, it is probably safer to use write.table instead for increased control of the process.
write.table(data, file="out.txt", append=T, sep="\t", row.names=F, col.names=T)
This is only to be used on a data frame or a matrix that already has the values spread out in rows and values (notice that there is no option for number of columns as there are predefined by the dataframe structure). row.names=F (meaning FALSE) tells R to skip enumerating the rows, since this usually adds and extra column to the output file (try this yourselves). Writing to files can also be done in csv mode with (you might have guessed already) write.csv(). In the same way we can write data to an Excel readable file with:
write.xlsx(mydata, "c:/mydata.xlsx")
As you advance in working with R you may need to store parts of the code (or simply commands) in simple text files and invoke them without having to re-write them. This can be easily done with the source() function. Simply store the code you want in a file (it helps if it carries a special extension like .R or .Rdata) and then call the source function on it like this:
source("file.R")
This will immediately execute all the commands contained in “file.R” (and provide error messages for those that couldn’t be executed). What may these commands be? Carry on to the next session(s).
Data types are important in R. Most, if not all of its functions, are to be executed on specific type(s) of data. It is therefore crucial that you make sure you are using the right one. Data types (pieces of data) belong to one of the following five types: * character (e.g. “Christoforos”, “A33”) * numeric (real numbers) (e.g. 3.14) * integer (e.g. 6) * complex (e.g. -1+4i) * logical (True/False or in some cases T/F)
Data objects may be organized in data classes which may be: * simple variables * vectors * factors * matrices * data frames
* lists
Simple variables are data types holding only one variable such as “Christoforos”, 6 or True. ### Vectors Vectors are enumerated arrays of data in the sense of uni-dimensional matrices, while matrices are…well matrices in the traditional form. ### Factors Conceptually, factors are categorical vectors, which take on a limited number of different values; such variables are often refered to as categorical variables. One of the most important uses of factors is in statistical modeling. You can read a bit more about them here: http://www.stat.berkeley.edu/~s133/factors.html but there will be more in the following) ### Data Frames Data frames (or dataframes) are lists of vectors of equal length but not necessarily of the same class. In this sense they differ from matrices which can only be numeric (or integer). Data frames are the most versatile (and convenient) way of representing and analyzing data. ### Matrices Matrices are strict data frames that only carry numerical values of the same type ### Lists Lists are generic vectors containing other objects. Lists do not need to contain vectors of the same size and are thus the most complex of data types in R.
Data types can be changed into one another with a technique called forced coercion with which some data types can be forced to become another. This works for some transformation but not all and is to be used with caution. For instance
x<- 1
is a number but
y<-as.character(x)
makes y a character equal to “1” (notice the double quotation marks). This can be changed back to a number with
xx<-as.numeric(y)
which makes xx equal to 1 again. Changes can be performed between numerical and characters as well as between the integers 0,1 and logicals with as.logical(). Nonetheless forced coercions are not a very good practice especially for beginners. Consider yourselves warned.
Vectors may be created with the simple function “concatanate” c or with the use of the vector function
x <- c(1,2,3)
y <- c("me","you","him")
The vector function is to be used mostly for initialization purposes
z <- vector("numeric", length=20)
This creates a vector of 20 “0” values.
Data types containing mixed objects are to be treated with extreme caution. This is because R coerces data
y <- c(1.7, "a") # y is now character
y <- c(TRUE, 2) # y is now numeric
We can find out the the class of a data type by typing
x <- c(1,2,"TRUE")
class(x)
## [1] "character"
and coerce the data type to the one we desire with the as.“” function
as.logical(x)
## [1] NA NA TRUE
when the coersion makes no sense, R returns the “NA” variable. Be prepared to see this a lot if you are not careful with your data assignments.
Matrices are introduced with the matrix function. As with vectors, matrices need to be assigned with dimension specifications. Matrices need two dimensions as they are two-dimensional.
m <- matrix(0, nrow=2, ncol=5)
creates a 2x5 matrix of zeros The dimensions of a matrix can be retrieved with the dim function
dim(m)
## [1] 2 5
R fills matrices by completing the columns, starting from the upper left part (element[1,1]). So if you wanted to fill m with the first 10 numbers that would be done by:
m <- matrix(1:10, nrow=2, ncol=5)
We can also create a matrix from a vector by adding a dimension attribute. The dimensions are in this case a vector themselves
x <- 1:10 # a vector of the numbers 1 to 10
dim(x) <- c(2,5) # dimensions read as number of rows, number of colums
Two very useful functions for matrix manimulation allow us to add, join rows and columns to an existing matrix, or create matrices by joining vectors. rbind joins vectors by rows and cbind does the same by columns
x <- 1:10
y <- 11:20
z <- rbind(x,y) # join x and y by treating them as rows of a matrix called z
z <- cbind(x,y) # join x and y by treating them as columns of a matrix called zz [,1] [,2]
Data frames are one of the most common and versatile data type in R. They are tabular lists and so can contain elements of different classes, but they are also matrix-like in the sense that all vectors in the list should be of the same size (length). Data frames are mostly read-in in R with specific commands (see Reading Data). They have special attributes that refer to the names of the data elements stored in each row (row.names), they can carry titles for the columns etc. Data frames can be converted to matrices by calling the data.matrix() function, but this should be handled with extra care due to the coercion issues covered earlier. When not reading data frames from a file already stored in the computer, we can declare them with commands like:
x<- data.frame(a=1:4, b=c("Me","You","Him","Her"))
x
## a b
## 1 1 Me
## 2 2 You
## 3 3 Him
## 4 4 Her
Notice that the columns have names (“a” and “b”) which were given in the declaration of the variable. Also notice that rows are numbered from 1 to 4. These can be recalled with the row.names() function
row.names(x)
## [1] "1" "2" "3" "4"
The size of the data frame can be given either with dim() or by calling the data frame-specific functions nrow() and ncol()
dim(x) # [1] 4 2
## [1] 4 2
nrow(x) # 4
## [1] 4
ncol(x) # 2
## [1] 2
Data frames are the most commonly used data type, especially when handling external data (files from your computer). We will see more of that later on.
Imagine your data are based on some categorical, non-numeric variable such as “good”, “bad”, or “diseased”,“healthy” etc. In this case you will need a data type that deals with non-numeric data. Factors are special types of vectors that handle categorical (non-numerical) data. Their main (and outmost) difference from vectors is that they are labeled instead of ordered. This means that a factor has “names” such as “Me”, “You” and “Him” instead of numbers such as 1, 2, 3 designated to its elements. In this sense, they are much more useful when trying to address different subsets of the data, a very important aspect of analysis that is called Subsetting. Inserting a factor is as easy as:
fac<-factor(c("me", "me", "you", "me", "him", "you"))
Notice that we actually call a function called factor() upon a vector, thus we say we “factorize” a vector. fac now holds the names of the variables “me”, “you” and “him” but it does so in specific positions. The different categorical values held can be visualized with the use of the levels() function
levels(fac)
## [1] "him" "me" "you"
Notice the variables are returned in alphabetical order. This order is used to assign specific numbers to each factor level. In this scheme, “him” will be given 1, “me” 2 and “you” 3. This can be visualized with unclass(). The unclass() function converts the levels to their attributed numbers.
unclass(fac)
## [1] 2 2 3 2 1 3
## attr(,"levels")
## [1] "him" "me" "you"
table(fac)
## fac
## him me you
## 1 3 2
fac
## [1] me me you me him you
## Levels: him me you
Below the “numerized” factor, R also returns the attributed levels, a sort of legend that tells you which number corresponds to which. Not all factors you ’ll be dealing with will be that small though and so it would be easy to have a summary of the levels representation in the factor. This is returned with the use of the table() function which returns the levels ordered by attribute number and the corresponding number of elements below it. This tells us that in our fac factor we had 1 instance of “him”, three instances of “me” and two of “you”. Remember that if the alphabetical ordering is not very suitable/convenient for you, you can always change it by adding a levels option in the factor declaration
fac<-factor(c("me", "me", "you", "me", "him", "you"), levels=c("me","you","him"))
table now will return the order that you chose instead of the default one table(fac)fac More on factors in the coming, not so introductory chapters.
Lists are data frames that do not have to follow the restriction of equal vector size. In this sense you may see them as data “blobs” that can hold simple variables or vectors of any type or size. For the moment it would be useful to know how to introduce one. Lets do it step by step, by creating three vectors first:
vec <- c(1,5,7);mec <- 1:10;dec <- c("TRUE","TRUE","FALSE","FALSE","FALSE")
Now lets put them all in a list with that order
l<-list(vec,mec,dec)
l now contains vec, mec and dec in this order. Which means that if call back the first element of l (by asking for l[1]) we will be getting the complete vector vec
l[1]
## [[1]]
## [1] 1 5 7
Notice the two-lined output R returns containing a “reference” with double brackets that points to our choice for the first element. We can get more than one element by invoking simple subsetting techniques. For instance we can retrieve the first and the third elements of the list by asking for them with a vector containing 1 and 3.
l[c(1,3)][[1]]
## [1] 1 5 7
which returns vec and dec, the 1st and the 3rd elements (but see more of subsetting later on).
Although the large proportion of R built-in function cannot handle lists, they remain important for the organization of data, especially when we are talking about big data. We ’ll just live them aside for the moment and get back to them when the time is ripe.
In case you are wondering, R also supports multi(higher) dimensional data types, called arrays. These are complex numerical data types of higher order, that we choose to skip discussing for the time being.
Too often are we faced with the problem of not realizing the class of the data we are handling. This is especially more troubling in the case of data frames and lists whose components may be of different classes. The class() function is called upon any data object and returns the data type. In the case of the above list l
class(l)
## [1] "list"
class(l) returns “list” which is exactly what l is. Now what if we wanted to know what is the data type of each of the object in l? In this case we may use the str() function
str(l)
## List of 3
## $ : num [1:3] 1 5 7
## $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ : chr [1:5] "TRUE" "TRUE" "FALSE" "FALSE" ...
str(l) returns a much more detailed output that contains the data class of l (List), the number of objects in it (of 3) and the data type of each of the objects (num, int, character) alongside the first instances in each one. Notice how the last object of l has been assigned the type “character” (chr). We can easily coerce it back to logical and put it in the list with
l<-list(vec,mec,as.logical(dec))
str(l)
## List of 3
## $ : num [1:3] 1 5 7
## $ : int [1:10] 1 2 3 4 5 6 7 8 9 10
## $ : logi [1:5] TRUE TRUE FALSE FALSE FALSE
Subsetting refers to the selection of parts of data from greater sets. Subsetting of data types is one very important aspect of the R environment in the sense that it can be performed with extreme precision and at great speeds. In this sense it constitutes one of R’s main advantages. Subsetting uses a number of special characters to perform various tasks, such as obtaining specific rows, columns, elements from all possible data types depending on the user’s choise. It can be roughly divided to: structural subsetting, where data are subsetted based on the structure of the data type (e.g. the 3 first columns) logical subsetting, where data are subsetted based on a logical restriction (e.g. all values that are not “NA”) numerical subsetting, where data are subsetted based on numerical/categorical operation/control (e.g. all values >10 or all values equal to “FALSE”)
Makes use of the [], [[]] and $ operators, which with the clever combination of commas can provide absolute precision on the choise of data with only a few characters coding. [] may be used on any vector, factor, matrix or dataframe to subset it in one or two dimensions. In the case of vectors and factors there is only one dimension. Therefore, if we want the nth element of a vector v we simply put n within brackets
v<-1:10
x<-v[6]
x now holds the sixth value of the vector v. R enumerates all data types starting from 1 (and not 0 like Perl or Python) so 6 will actually return the 6th value.
x<-v[6:8]
will get a “slice” of v and store it in x which now becomes a vector itself carrying the 6th,7th and 8th elements of v. Remember how the “:” operator is used for ordered integers. Now what if we wanted some compartmentalized subsetting that does not follow a certain order
x<-v[c(1, 6:8, 11:15, 18, 20)]
This intricate subsetting allows us to get the 1st, then the 6th-8th, then the 11th-15th, then the 18th and then the 20th elements of v and store them in a vector called x. Notice how the subsetting indices (the numbers in the parentheses) are a vector in themselves and thus they are introduced with the concatanate function c(). We could have greater control in this subsetting if we split the process in two
m<-1:100
ind<-c(1, 6:8, 11:15, 18, 20)
x<-m[ind]
This first creates a vector called ind that carries the indices (the numbers of elements we want to obtain from v) and then passes it to v with [] to perform the subsetting. The exact same process stands for factors (which as we already know are categorical vectors). But what about matrices and dataframes? Here there are two dimensions on which we can subset (rows and columns). R uses the same operator [] but allows for two values separated by comma to provide information of rows and columns (in this order). Although both of them are not always needed (suppose you only want to subset columns but not rows) R needs to keep in mind we are treating two-dimensional datatypes so we need to use a comma inbetween. This will become less confusing with an example. Suppose we need to keep only the first and the third line from a matrix m. This is done with:
m<-mtcars
mm <- m[,c(1,3)]
Notice that the indices inside the brackets are of the form [ ,vector]. That is because the value before the comma is reserved for row subsetting. Since we don’t want to subset on the rows we leave this empty, but use the comma since this is compulsory. After the comma we simply provide the vector of indices we want to subset columns by (in this case c(1,3) for the 1st and the 3rd). In perfect symmetry subsetting on rows 10 through 20 would be performed with:
m<-mtcars
mm <- m[10:20,]
As in this case the indices are serial we need not use a concatanated structure so 10:20 will do. Alternatively we could use c(10,11,12,13,14,15,16,17,18,19,20) but we don’t for obvious reasons. Bear in mind that in both cases above mm is a matrix (or data frame, depending on what m was) whose dimensions have now changed. If m was a MxN matrix, then mm is a Mx2 in the first case and 10xN in the second. In the case we subset on both rows and columns
m<-matrix(1:140, nrow=10, ncol=14)
mm <- m[c(1:5,8), c(10,11, 12:14)]
mm is now a 6x5 matrix. Understandably we can return any single element of a matrix or data frame by providing its exact “coordinates” and thus
m[6,9]
## [1] 86
will return the 6th element of the 9th column. In the case of data frames with named columns we can also use the $ operator to subset columns. Take the built-in R data frame called mtcars simply by typing
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
This small dataframe contains makes of cars alongside their constructor specifications. In order to choose one specific specification simply ask for the name of the data frame followed by $ and the name of the column. For instance
mtcars$cyl
## [1] 6 6 4 6 8 6 8 4 4 6 6 8 8 8 8 8 8 4 4 4 4 8 8 8 8 4 4 4 8 6 8 4
will return the vector containing the cyl column. This is identical to calling mtcars[,2] asking thus for the second column of the data frame (not including the names of the cars). The [[]] and $ operators are mostly used in list context. Although, as we saw earlier, [n] can be used to invoke the nth element of a list, [[n]] does the same without getting back the reference (the name) of that element. This is sometimes desirable, especially when we want to pass data from a list to another function. Consider the list l we saw earlier
l<-list(vec,mec,dec)
where vec, mec and dec are three vector of different type and size. l can be created by keeping names for each of the three
l<-list(a=vec, b=mec, c=dec)
now vec can be invoked with all of the following commands
l[1]
## $a
## [1] 1 5 7
l[[1]]
## [1] 1 5 7
l$a
## [1] 1 5 7
There are subtle differences are in the form of the output. From top to bottom the output is stripped from (sometimes unneccesary) references.
Makes use of logical operators (“&”, “!”, “|”, see more on that in Control Structures) Logical operators stand for AND (=“&”), OR (=“|”) and NOT (“!”). While the first two are more complex as “joining” operators and we will see more of them when we discuss control structures, NOT “!” can stand on its own as it signifies the negation of a statement. In this sense
!is.na(x)
## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
returns logical values (TRUE or FALSE) depending on whether any value in x is NOT “NA”. In this sense, !is.na() is the mirror-image of is.na(). Lets use this in a subset
y<- x[!is.na(x)]
y is now a vector with all the elements of x that are NOT “NA”. There are two things to be careful of here. One, that the subsetting index refers to the subsetted data object (x). Notice how x is subsetted based on one of its own properties (how many of its elements are not “NA”). The other is that the output is a vector, regardless of the structure of x. Even if x was to be a matrix or a data frame, the logical subsetting will return the values reading by column starting from the top left.
Makes uses of the common numerical operators (>, <, >=, <=, “!=” and ==). You should already be familiar with “>”,“<” and “<=”, “=>” as “greater”, “smaller”, “smaller or equal” and “greater or equal”, but notice “!=” signifying “not equal to” and “==” for “equal to” (this is because in R as in most programming languages, we use “=” for assignment of variable values, (in R “->” and “=” are the same but we strongly recommend “->” to avoid confusion)). Numerical subsetting works exactly like logical. So assuming x is a data frame holding these values
x<- data.frame(a=1:4, b=c("Me","You","Him","Her"))
Then, subsetting it numerically by asking only values equal to 2 would be
x[(x==2)]
## [1] "2"
This returns the value “2” (only once in this case, but more times if 2 were to be found more than once). What if we wanted to get values that are greater than 2. We could try
x[(x>2)]
## Warning in Ops.factor(left, right): '>' not meaningful for factors
## [1] "3" "4" NA NA NA NA
which will give a > Warning message: In Ops.factor(left, right) : > not meaningful for factors
What happened here? We got the two numerical values that fulfill our restriction (>2) but we got four NA values at the end and an error message that said our operation is not meaningful for factors. What R is trying to tell us is that a comparison >2 is not possible for categorical data like “Me” or “You”. In this sense the operation “Me”>2 returns nothing (NA). R prints this at the end of the vector but is kind enough to point it out to us. Numerical subsetting is not always numerical. It can be categorical as well with the use of the “==” and the “!=” operators. If we try
x[(x!="Me")]
## [1] "1" "2" "3" "4" "You" "Him" "Her"
we get back all the elements of x that are not equal to “Me” regardless if they are numbers or characters. This is because R performs coercion of variables wherever this is possible before conducting the subsetting. This is very handy in the general cases of data comparisons that we discuss next in Control Structures. One very handy function for subsetting is which(). It can be used for both logical and numerical subsetting to produce a subset of indices fulfilling certain conditions.
x<-c(1,-1, 0, 3, -20, -2)
which(x>=1)->ind
x[ind]->new_x
which() can also be used in logical context.
Lets now use subsetting in an example that is very useful (and also quite common). Removing NA values from a data frame. The task we want to complete is, given a data frame (or a matrix), extract the instances (rows) that do not have NA values (holes). As with many cases, R already has a built-in function to check for lines in a matrix that do not carry NA values. This function is complete.cases() and can be invoked on a matrix m like this
ind<-complete.cases(m)
ind now is a vector that holds the line numbers of the data frame that fulfill the condition (not having a NA value). All we have to do now is to ask for a subset of m with the rows held in ind
mm<-m[ind,]
And we are done! A clean data set.
A number of things in R can be done with the use of predefined functions. Subsetting is not an exception. There is a specific subsetting function called subset(). subset() combines the use of all types of subsetting (numerical, logical and structural) in data frames with named columns. It also makes use of logical operators to combine subsetting commands in a single. An example may be seen with a default R dataset called “airquality”. “airquality” is structured data in a data frame concerning information on ozone, solar radiation, wind and temperature for a number of dates organized by month and day. To have a better view of its contents simply type
head(airquality)
## Ozone Solar.R Wind Temp Month Day
## 1 41 190 7.4 67 5 1
## 2 36 118 8.0 72 5 2
## 3 12 149 12.6 74 5 3
## 4 18 313 11.5 62 5 4
## 5 NA NA 14.3 56 5 5
## 6 28 NA 14.9 66 5 6
Notice how there are holes in the data with a number of NA values. Suppose now that we want to obtain a slice of the data that we contains all dates with a temperature higher than 60 degrees and a wind of 10 knots or more. We would type
dates<-subset(airquality, Temp > 60 & Wind>=10)
Observe how the function works. We call subset() on the dataframe airquality asking that a combined condition is fulfilled, so as Temp>60 AND Wind>=10. The logical “AND” is coded by the ambersand “&” symbol. Alternatively, had we wanted to keep the dates with either Temp>60 OR Wind>=10 we would have asked for
dates<-subset(airquality, Temp > 60 | Wind>=10)
in which case the logical “OR” is coded with the bar symbol “|”. Think about the case where we would have wanted mutual exclusion of conditions (e.g. Temp>60 “AND NOT” Wind>=10). In this case we would have to think a bit more and code for an equivalent condition. That would be Temp>60 & Wind<10 (inverting the condition on wind). Finally, subset can also incorporate structural subsetting in the form of retaining specific parts of the data. In the case e.g. that we would have wanted to keep only the dates (that is month and day) fulfilling the above condition (that is dropping the meteorological data) all we need to do is to use the select argument of subset(). The command would now be:
dates<-subset(airquality, Temp > 60 & Wind<10, select = c(Day, Month))
head(dates)
## Day Month
## 1 1 5
## 2 2 5
## 7 7 5
## 10 10 5
## 11 11 5
## 12 12 5
Notice how we have complete freedom to manipulate the data frame in terms of ordering of its vectors. In this example we choose to show Days before Months although their order was the other way round in the initial data frame.
One of R’s main powers is the great number of predefined functions. Taken together the set of built-in functions alongside those contained in various R packages constitute an extensive toolbox with which you can perform mathematical calculations, conduct complex statistical analysis and create elegant and highly informative graphs. A detailed summary of the complete “dictionary” of R functions is beyond the scope of these notes (and may well be beyond the scope of most of R manuals). Nonetheless you can find the index of R functions contained in the base “core” distribution here.
For the purposes of our classes we will discuss only a subset of all available functions, focusing on basic mathematical operations, functions that deal with basic (and a little more advanced) statistics and plotting functions.
We have already seen how mathematical calculations may be performed in R in more or less the way one uses a calculator. Remember that we can simply type the number of variables with operators inbetween and get the result by hitting return (enter)
a<-5
b<-6
a + b
## [1] 11
The basic mathematical operators are recapped in the list below
Arithmetic Operators * + addition * - subtraction * * multiplication * / division * ^ or ** exponentiation * x %% y modulus (x mod y) 5%%2 is 1 * x %/% y integer division 5%/%2 is 2 Out of those you may not be familiar with the last two. Modulus provides the remainder of a division between two integers, while integer division is the result of a division without proceeding further than the integral part of the quotient.
Logical operators have been discussed previously in the chapters related to Subsetting. A formal definition of logical operators (or connectors, or connectives) is a set of symbols that may be used to connect two or more statements in a grammatically valid way. In the example discussed above for the subset() function these two sentences would be: a) “The temperature is greater than or equal to 60 degrees” and b) “The wind speed is greater than 10 knots”. Each of the two sequences contain some logical operators in itself. The temperature>=60 and the wind>10 are examples of numerical control operators. Combinations thereof consist of asking: * Both of them being true (A AND B) * Only one of them being true (A AND NOT B, B AND NOT A) * At least one of them being true (A OR B) All of the above can be coded with specific logical operators listed below. * < less than * <= less than or equal to * > greater than * >= greater than or equal to * == exactly equal to * != not equal to * !x Not x * x | y x OR y * x & y x AND y In the following chapters we will see how we go about using them and so getting used to working them on a routine basis
Let’s now get to the core of the basic R function toolbox. As discussed previously R almost has a function for everything. A number of these functions are very useful for getting beginners accustomed to the way R works and to prepare them for writing their own functions (when they stop being beginners).
Below we present some of the most important R functions for various objectives covering basic statistics, plotting and string manipulation.
Numeric functions include functions used to performed more advanced mathematical operations. Some of the most important are:
abs(x) : absolute value
abs() is used to return the absolute value of a number (integer or not). This means that both abs(-5) abs(5) return the value 5.
sqrt(x) : square root sqrt() returns the square root of a number.
ceiling(x), floor(x), trunc(x), round(x, digits=n), signif(x, digits=n) All of these functions treat real numbers in terms of rounding (that is ommission of decimal points). ceiling(x) rounds up the real number x to the closest integer that is greater than x (>x) while floor(x) does the opposite, rounding x to the closest integer that is smaller than x (<x). In this sense
ceiling(5.1)
## [1] 6
returns 6, while
floor(5.98)
## [1] 5
returns 5
trunc(x) truncates all decimal points rounding the number to the closest integer in the same way floor() does it. round() and signif() both take a number of digits as an additional argument but slightly differ in that round(x, digits=n) does rounding in a way that keeps n decimal points while signif(x, digits=n) rounds to n total digits (not including the comma). Thus:
x<-45.6789
round(x, digits=3)
## [1] 45.679
signif(x, digits=3)
## [1] 45.7
cos(x), sin(x), tan(x), acos(x), cosh(x), acosh(x) Refer to the corresponding trigonometric functions for cosine, sine, tangent, arc-cosine, hyperbolics etc.
log(x) : natural logarithm, log10(x) : common logarithm
log() is the natural logarithm while base-10 log is coded as log10(). Do you remember what you need to do to convert a natural log to, say, a base-2 logarithm? In case you don’t remember how to change the base you can always use logb(x, base=n)
exp(x) : exponential of e remember generic exponentiation is coded either through a^x or a**x.
Character handling is not one of R’s major strong point and you can always work around data manipulation in character strings outside R with shell scripting or other scripting languages. Still, R provides a number of functions we can use to handle strings when we need to stay within the environment.
substr(x, start=n, stop=m)
The substr() function works more or less in the way the function of the same name in Perl. In a string x it returns a substring starting from n and running through m. Remember that numbering in R starts from 1.
grep(pattern, x , ignore.case=FALSE, fixed=FALSE)
Pattern matching in R is performed with grep() that searches for pattern in x. Using fixed=F assures that pattern search may be performed with a regular expression. Notice that the functions returns the matching indices, that is the result of the matching is the subset of elements of the vector x that match the pattern. A substitution function sub() is similar to grep but with the addition of a replacement string sub(pattern, replacement, x, ignore.case=F, fixed=F) sub(pattern, replacement, x, ignore.case =FALSE, fixed=FALSE) for instance
sub("\\s",".","Hello There")
## [1] "Hello.There"
Splitting of a string is performed with strsplit() strsplit(x, split) where split is the character(s) at which splitting takes place. strsplit() returns a character vector. As in the previous functions, fixed=T allows split to be a regular expression. For instance:
x<-"abc.d.eef.g"
strsplit(x,".", fixed=T)->d
d
## [[1]]
## [1] "abc" "d" "eef" "g"
The paste() function joins strings in a concatenation using sep as a separator
paste("x",1:3,sep="")
## [1] "x1" "x2" "x3"
paste("x",1:3,sep="M")
## [1] "xM1" "xM2" "xM3"
Transforming strings to upper or lower case is explicitly done with toupper(x) and tolower(x).
A number of very useful functions do not fall in a specifi category but will prove very handy once you start coding your own R scripts and functions. Suppose you need to generate a sequence of numbers with a fixed interval. Remember that this can be done with the “:” operator only for interval=1 but if we want another step we may use seq(from , to, by) If you code
x <- seq(1,20,3)
x
## [1] 1 4 7 10 13 16 19
x becomes the vector of all values down to the largest that fulfils the condition < to-by Another type of sequence we may need to create is the repetition of elements. This is performed with the use of the rep() function rep(x, ntimes)
y <- rep(1:3, 2)
y
## [1] 1 2 3 1 2 3
Obtaining a random set of values from a greater set is done with sample(). sample(x, size, replace=F/T) sample() takes a subset of size=size from the vector x and returns it a smaller vector. If replace=T then the same element can be drawn more than once.
pretty(c(start,end), N) returns a vector of equally spaced N values between start and end. Invaluable for simulating distributions, producing values for standard reference plots etc.
sort(x) returns the vector x sorted in numerical order from the smallest to the largest element while
order(m[,c(i,j)]) orders a matrix m according first ot the values in the i-th and then in the j-th columns. Calling
m<-matrix(runif(100), nrow=50, ncol=2)
i<-1; j<-2
m[order(m[,i],m[,j]),]
## [,1] [,2]
## [1,] 0.01941539 0.34819635
## [2,] 0.02154932 0.26899819
## [3,] 0.02386933 0.89184715
## [4,] 0.05509549 0.24741497
## [5,] 0.07791300 0.49384672
## [6,] 0.10828620 0.88483344
## [7,] 0.12851299 0.07585098
## [8,] 0.12916345 0.10612936
## [9,] 0.15940345 0.78777651
## [10,] 0.20370307 0.58113488
## [11,] 0.20548575 0.70642089
## [12,] 0.21469433 0.77918161
## [13,] 0.25848980 0.29863826
## [14,] 0.26787334 0.87951580
## [15,] 0.29316299 0.08517458
## [16,] 0.30864109 0.80778152
## [17,] 0.32752031 0.05264705
## [18,] 0.32921001 0.63317854
## [19,] 0.43851530 0.48556729
## [20,] 0.46081012 0.84923816
## [21,] 0.46329310 0.74915763
## [22,] 0.47354667 0.82944385
## [23,] 0.48784083 0.95468440
## [24,] 0.49316508 0.66721321
## [25,] 0.49915382 0.75940082
## [26,] 0.54052345 0.74446531
## [27,] 0.56173580 0.55928891
## [28,] 0.56474386 0.24409950
## [29,] 0.62737051 0.28184866
## [30,] 0.62902509 0.22230040
## [31,] 0.65773682 0.97934441
## [32,] 0.75524459 0.74239171
## [33,] 0.77329347 0.47539288
## [34,] 0.78628321 0.69725261
## [35,] 0.85403166 0.64415563
## [36,] 0.87482882 0.58901165
## [37,] 0.88178353 0.32868077
## [38,] 0.88684730 0.58076555
## [39,] 0.89527528 0.61890956
## [40,] 0.92542161 0.11068735
## [41,] 0.92963014 0.46919044
## [42,] 0.94620639 0.63236670
## [43,] 0.95125025 0.29532427
## [44,] 0.95840734 0.14516101
## [45,] 0.96348086 0.02117248
## [46,] 0.96420802 0.31375356
## [47,] 0.97001999 0.93473472
## [48,] 0.97587025 0.04801724
## [49,] 0.97940268 0.03968395
## [50,] 0.98872487 0.14424190
will produce a re-ordered matrix according to the above ordering. (More about the very useful runif() function later on)