Bioconductor can be easily installed with the following code:
source("https://bioconductor.org/biocLite.R")
biocLite()
The first line downloads a scrip containing the function biocLite, while the latter will download and install the core set of bioconductor packages.
A CEL file contains a binary representation of the fluorescent levels of a single microarray. In this lesson we will use 6 CEL files from the following study:
Nair RP et al., Genome-wide scan reveals association of psoriasis with IL-23 and NF-kappaB pathways. Nat Genet. 2009 Feb;41(2):199-204. doi: 10.1038/ng.311
Three CEL files were profiled from biopsies taken in lesional skin, while the other ones from biopsies taken in healthy tissue. The easiest way for loading CEL files in R is through the **affy* (standing for “affymetrix”) package:
#loading the library
library(affy, quietly = TRUE)
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which, which.max,
## which.min
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
#files names
fileList <- dir('CEL/', '.CEL', full.names = FALSE)
#sample information
pheno <- data.frame(group =c(rep('healthy', 3), rep('lesion', 3)),
stringsAsFactors = FALSE);
rownames(pheno) <- fileList
#reading the files
affyBatch <- ReadAffy(celfile.path = 'CEL', phenoData = pheno);
The affyBatch object now contains two types of information:
Notice that the latter may contain more than one column, if needed.
We can now apply the RMA algorithm on the samples:
#RMA
expressionSet <- rma(affyBatch, normalize = TRUE, background = TRUE);
## Warning: replacing previous import 'AnnotationDbi::tail' by 'utils::tail'
## when loading 'hgu133plus2cdf'
## Warning: replacing previous import 'AnnotationDbi::head' by 'utils::head'
## when loading 'hgu133plus2cdf'
##
## Background correcting
## Normalizing
## Calculating Expression
The expressionSet object is similar to affyBatch, but it contains the processed data instead of the raw ones. Intestingly, expressionSets behave much as data frames / matrices:
#expressionSet dimensions
dim(expressionSet)
## Features Samples
## 54675 6
#columns and rows names
colnames(expressionSet)
## [1] "GSM337204.CEL.gz" "GSM337205.CEL.gz" "GSM337206.CEL.gz"
## [4] "GSM337367.CEL.gz" "GSM337368.CEL.gz" "GSM337369.CEL.gz"
head(rownames(expressionSet))
## [1] "1007_s_at" "1053_at" "117_at" "121_at" "1255_g_at" "1294_at"
#subsetting
dim(expressionSet[1:100, 1:4])
## Features Samples
## 100 4
Moreover, is it possible to extract and update both the expression and the phenotype data:
#expression data
dataMatrix <- exprs(expressionSet)
dataMatrix[1:5, 1:5]
## GSM337204.CEL.gz GSM337205.CEL.gz GSM337206.CEL.gz
## 1007_s_at 11.038991 10.439639 10.740932
## 1053_at 7.548828 7.882609 7.701032
## 117_at 6.960496 7.608002 7.328793
## 121_at 8.203228 7.936755 8.261642
## 1255_g_at 3.763614 3.914914 4.150167
## GSM337367.CEL.gz GSM337368.CEL.gz
## 1007_s_at 10.057551 10.504319
## 1053_at 8.432856 7.906931
## 117_at 7.335760 7.140861
## 121_at 7.521465 7.699498
## 1255_g_at 3.849545 3.994785
#phenotype data
pheno <- pData(expressionSet)
head(phenoData)
##
## 1 structure(function (object)
## 2 standardGeneric("phenoData"), generic = structure("phenoData", package = "Biobase"), package = "Biobase", group = list(), valueClass = character(0), signature = "object", default = `\\001NULL\\001`, skeleton = (function (object)
## 3 stop("invalid call in method dispatch to 'phenoData' (no default method)",
## 4 domain = NA))(object), class = structure("standardGeneric", package = "methods"))
#adding gender to the phenotype data
pheno$gender <- c('m', 'f', 'm', 'f', 'm', 'f' );
#updating the expression set
pData(expressionSet) <- pheno
head(pData(expressionSet))
## group gender
## GSM337204.CEL.gz healthy m
## GSM337205.CEL.gz healthy f
## GSM337206.CEL.gz healthy m
## GSM337367.CEL.gz lesion f
## GSM337368.CEL.gz lesion m
## GSM337369.CEL.gz lesion f
Bioconductor provides annotations for most microarray platoform. These annotations are installed as a normal package, and provide gene names for each probesets.
#loading the annotation. Notice the ".db" at the end of the name
library(hgu133plus2.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: org.Hs.eg.db
## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none
## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none
##
## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none
## Warning: vfs customization not available on this platform. Ignoring value:
## vfs = unix-none
##
#extracting gene symbols for each probeset
genesSymbols <- as.list(hgu133plus2SYMBOL);
probesetsNames <- names(genesSymbols);
genesSymbols <- as.character(genesSymbols)
names(genesSymbols) <- probesetsNames
head(genesSymbols)
## 1007_s_at 1053_at 117_at 121_at 1255_g_at 1294_at
## "NA" "RFC2" "HSPA6" "PAX8" "GUCA1A" "NA"