Installing Bioconductor

Bioconductor can be easily installed with the following code:

source("https://bioconductor.org/biocLite.R")
biocLite()

The first line downloads a scrip containing the function biocLite, while the latter will download and install the core set of bioconductor packages.

Preprocessing CEL files with the affy package

A CEL file contains a binary representation of the fluorescent levels of a single microarray. In this lesson we will use 6 CEL files from the following study:

Nair RP et al., Genome-wide scan reveals association of psoriasis with IL-23 and NF-kappaB pathways. Nat Genet. 2009 Feb;41(2):199-204. doi: 10.1038/ng.311

Three CEL files were profiled from biopsies taken in lesional skin, while the other ones from biopsies taken in healthy tissue. The easiest way for loading CEL files in R is through the **affy* (standing for “affymetrix”) package:

#loading the library
library(affy, quietly = TRUE)
## 
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
## 
##     clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
##     clusterExport, clusterMap, parApply, parCapply, parLapply,
##     parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
## 
##     IQR, mad, xtabs
## The following objects are masked from 'package:base':
## 
##     anyDuplicated, append, as.data.frame, cbind, colnames,
##     do.call, duplicated, eval, evalq, Filter, Find, get, grep,
##     grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
##     match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
##     Position, rank, rbind, Reduce, rownames, sapply, setdiff,
##     sort, table, tapply, union, unique, unsplit, which, which.max,
##     which.min
## Welcome to Bioconductor
## 
##     Vignettes contain introductory material; view with
##     'browseVignettes()'. To cite Bioconductor, see
##     'citation("Biobase")', and for packages 'citation("pkgname")'.
#files names
fileList <- dir('CEL/', '.CEL', full.names = FALSE)

#sample information
pheno <- data.frame(group =c(rep('healthy', 3), rep('lesion', 3)), 
                    stringsAsFactors = FALSE);
rownames(pheno) <- fileList

#reading the files
affyBatch <- ReadAffy(celfile.path = 'CEL', phenoData = pheno);

The affyBatch object now contains two types of information:

  1. the actual raw microarray data
  2. a data frame specifying the characteristics of each sample

Notice that the latter may contain more than one column, if needed.

We can now apply the RMA algorithm on the samples:

#RMA 
expressionSet <- rma(affyBatch, normalize = TRUE, background = TRUE);
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail'
## when loading 'hgu133plus2cdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head'
## when loading 'hgu133plus2cdf'
## 
## Background correcting
## Normalizing
## Calculating Expression

The expressionSet object is similar to affyBatch, but it contains the processed data instead of the raw ones. Intestingly, expressionSets behave much as data frames / matrices:

#expressionSet dimensions 
dim(expressionSet)
## Features  Samples 
##    54675        6
#columns and rows names
colnames(expressionSet)
## [1] "GSM337204.CEL.gz" "GSM337205.CEL.gz" "GSM337206.CEL.gz"
## [4] "GSM337367.CEL.gz" "GSM337368.CEL.gz" "GSM337369.CEL.gz"
head(rownames(expressionSet))
## [1] "1007_s_at" "1053_at"   "117_at"    "121_at"    "1255_g_at" "1294_at"
#subsetting
dim(expressionSet[1:100, 1:4])
## Features  Samples 
##      100        4

Moreover, is it possible to extract and update both the expression and the phenotype data:

#expression data
dataMatrix <- exprs(expressionSet)
dataMatrix[1:5, 1:5]
##           GSM337204.CEL.gz GSM337205.CEL.gz GSM337206.CEL.gz
## 1007_s_at        11.038991        10.439639        10.740932
## 1053_at           7.548828         7.882609         7.701032
## 117_at            6.960496         7.608002         7.328793
## 121_at            8.203228         7.936755         8.261642
## 1255_g_at         3.763614         3.914914         4.150167
##           GSM337367.CEL.gz GSM337368.CEL.gz
## 1007_s_at        10.057551        10.504319
## 1053_at           8.432856         7.906931
## 117_at            7.335760         7.140861
## 121_at            7.521465         7.699498
## 1255_g_at         3.849545         3.994785
#phenotype data
pheno <- pData(expressionSet)
head(phenoData)
##                                                                                                                                                                                                                                        
## 1 structure(function (object)                                                                                                                                                                                                          
## 2 standardGeneric("phenoData"), generic = structure("phenoData", package = "Biobase"), package = "Biobase", group = list(), valueClass = character(0), signature = "object", default = `\\001NULL\\001`, skeleton = (function (object) 
## 3 stop("invalid call in method dispatch to 'phenoData' (no default method)",                                                                                                                                                           
## 4     domain = NA))(object), class = structure("standardGeneric", package = "methods"))
#adding gender to the phenotype data
pheno$gender <- c('m', 'f', 'm', 'f', 'm', 'f' );

#updating the expression set
pData(expressionSet) <- pheno
head(pData(expressionSet))
##                    group gender
## GSM337204.CEL.gz healthy      m
## GSM337205.CEL.gz healthy      f
## GSM337206.CEL.gz healthy      m
## GSM337367.CEL.gz  lesion      f
## GSM337368.CEL.gz  lesion      m
## GSM337369.CEL.gz  lesion      f

Annotation files

Bioconductor provides annotations for most microarray platoform. These annotations are installed as a normal package, and provide gene names for each probesets.

#loading the annotation. Notice the ".db" at the end of the name
library(hgu133plus2.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
## 
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
## 
##     colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: org.Hs.eg.db
## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none

## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none
## 
## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none

## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none
## 
#extracting gene symbols for each probeset
genesSymbols <- as.list(hgu133plus2SYMBOL);
probesetsNames <- names(genesSymbols);
genesSymbols <- as.character(genesSymbols)
names(genesSymbols) <- probesetsNames
head(genesSymbols)
## 1007_s_at   1053_at    117_at    121_at 1255_g_at   1294_at 
##      "NA"    "RFC2"   "HSPA6"    "PAX8"  "GUCA1A"      "NA"

Note regarding the types of files

  1. .CEL files: each CEL file contains a binary representation of the fluorescent levels measured on a single microarray
  2. cdf files: these files describe the association between probes and probesets fora give platform
  3. .db files: annotation files containing several gene identifiers and other information for each probeset.