In this lesson we will analyze microarray gene-expression data presented in
Opel D et al., Targeting inhibitor of apoptosis proteins by Smac mimetic elicits cell death in poor prognostic subgroups of chronic lymphocytic leukemia. Int J Cancer. 2015 Dec 15;137(12):2959-70
Peripheral Blood Mononuclear Cells (PBMC) from 12 B CLL patients were treated with Bv6 + DMSO or DMSO alone (controls). The treatment lasted either 4 or 20 hours. Gene expression profiling was performed with Affymetrix Human Chip 133 Plus 2.0. Patients had different genetics background.
Our goal is to identify genes that behave differently between treatments.
The following image represent the study design:
We can use the standard affy libraries and bioconductor annotation packages (gu133plus2cdf and gu133plus2.db) for loading, preprocessing and annotate our data:
#library for affymetrix preprocessing
library(affy);
## Loading required package: BiocGenerics
## Loading required package: parallel
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following object is masked from 'package:RevoScaleR':
##
## colnames
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, as.vector, cbind,
## colnames, do.call, duplicated, eval, evalq, Filter, Find, get,
## grep, grepl, intersect, is.unsorted, lapply, lengths, Map,
## mapply, match, mget, order, paste, pmax, pmax.int, pmin,
## pmin.int, Position, rank, rbind, Reduce, rownames, sapply,
## setdiff, sort, table, tapply, union, unique, unlist, unsplit
## Loading required package: Biobase
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
#loading the phenotype data
load('pheno.RData')
head(pheno)
## agent time individual genotype_variation
## GSM1528449.CEL.gz BV6 hour 4 patient 23 karyotype: normal
## GSM1528455.CEL.gz BV6 hour 4 patient 50 karyotype: 13q-
## GSM1528457.CEL.gz BV6 hour 4 patient 51 karyotype: 13q-, 14q-
## GSM1528447.CEL.gz BV6 hour 20 patient 23 karyotype: normal
## GSM1528451.CEL.gz BV6 hour 20 patient 44 karyotype: 13q- bidel
## GSM1528453.CEL.gz BV6 hour 20 patient 50 karyotype: 13q-
## other
## GSM1528449.CEL.gz tp53 mutation: MUT
## GSM1528455.CEL.gz tp53 mutation: WT
## GSM1528457.CEL.gz tp53 mutation: WT
## GSM1528447.CEL.gz tp53 mutation: MUT
## GSM1528451.CEL.gz tp53 mutation: WT
## GSM1528453.CEL.gz tp53 mutation: WT
#applying RMA directly on the CEL files
dataset <- justRMA(celfile.path = 'CELFiles/', phenoData = pheno)
##
#annotating probesets
library(hgu133plus2.db)
## Loading required package: AnnotationDbi
## Loading required package: stats4
## Loading required package: IRanges
## Loading required package: S4Vectors
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Loading required package: org.Hs.eg.db
##
##
genesSymbols <- as.list(hgu133plus2SYMBOL)
probesetsNames <- names(genesSymbols)
genesSymbols <- as.character(genesSymbols)
names(genesSymbols) <- probesetsNames
head(genesSymbols)
## 1007_s_at 1053_at 117_at 121_at 1255_g_at 1294_at
## "NA" "RFC2" "HSPA6" "PAX8" "GUCA1A" "NA"
Alternatively, we can use the cdf file provided by the brainarray initiative. First, let’s download and manually install hgu133plus2hsentrezg.db_20.0.0 and hgu133plus2hsentrezgcdf_20.0.0 (either zip files for windows or source files for unix) from http://brainarray.mbni.med.umich.edu/bioc/bin/windows/contrib/3.2/
Then, the following line will provide an alternative preprocessing:
#obtaining the data preprocessed with brainarray
datasetBA <- justRMA(celfile.path = 'CELFiles/',
cdfname = 'hgu133plus2hsentrezg',
phenoData = pheno)
##
#annotations
library(hgu133plus2hsentrezg.db)
##
genesSymbolsBA <- as.list(hgu133plus2hsentrezgSYMBOL)
probesetsNames <- names(genesSymbolsBA)
genesSymbolsBA <- as.character(genesSymbolsBA)
names(genesSymbolsBA) <- probesetsNames
head(genesSymbolsBA)
## 100009676_at 10000_at 10001_at 10002_at 100038246_at
## "ZBTB11-AS1" "AKT3" "MED6" "NR2E3" "TLX1NB"
## 10003_at
## "NAALAD2"
Let’s check the level of similarity between the two datasets
#number of probesets
print(dim(dataset))
## Features Samples
## 54675 12
#number of probesets brainarray:
print(dim(datasetBA))
## Features Samples
## 19425 12
#how many probesets with no annotations?
print(sum(genesSymbols == 'NA'))
## [1] 12352
#how many probesets with no annotations (brainarray)?
print(sum(genesSymbolsBA == 'NA'))
## [1] 163
#how many genes with multiple probesets?
print(sum(duplicated(genesSymbols[genesSymbols != 'NA'])))
## [1] 21809
#how many genes with multiple probesets (brainarray)?
print(sum(duplicated(genesSymbolsBA[genesSymbolsBA != 'NA'])))
## [1] 0
In the rest of the analysis we will use the dataset produced with the bioconductor cdf definition.
The first step before proceeding with any analysis is checking the quality of the data. This is valid in any data analysis task, not only for microarray data. Unreliable data lead to wrong or deceiving findings and conclusions.
Quality control in microarray is performed through several checks. The first one consists in ensuring the success of quantile normalization. This is usually performed with boxplots
#extracting the matrix of expression data as well as the phenotype data
dataMatrix <- exprs(dataset)
pheno <- pData(dataset)
#good practice: sorting both in terms of sample names
dataMatrix <- dataMatrix[ , order(colnames(dataMatrix))]
pheno <- pheno[order(rownames(pheno)), ]
#stacking profiles
toPlot <- stack(as.data.frame(dataMatrix))
head(toPlot)
## values ind
## 1 9.136059 GSM1528447.CEL.gz
## 2 7.740218 GSM1528447.CEL.gz
## 3 5.968468 GSM1528447.CEL.gz
## 4 7.833371 GSM1528447.CEL.gz
## 5 3.174524 GSM1528447.CEL.gz
## 6 7.401016 GSM1528447.CEL.gz
#boxplot
boxplot(values~ind, toPlot)
Distributions seem fairly equivalent in the boxplot, so we may rule out the possibility of batch-effects. Let’s assume for sake of an example that samples GSM1528453.CEL.gz, GSM1528454.CEL.gz, GSM1528455.CEL.gz, GSM1528456.CEL.gz have a slightly shifted distribution:
#artificially modifying the distribution of selected samples
toPlot$values[toPlot$ind %in% c('GSM1528455.CEL.gz', 'GSM1528453.CEL.gz', 'GSM1528456.CEL.gz', 'GSM1528454.CEL.gz')] <- toPlot$values[toPlot$ind %in% c('GSM1528455.CEL.gz', 'GSM1528453.CEL.gz', 'GSM1528456.CEL.gz', 'GSM1528454.CEL.gz')] + 2;
#plotting again
boxplot(values~ind, toPlot)
This time we observe a shift in the distributions. If we do not know what is the cause of this shift, we can start investigating by coloring the boxes according to the factors in the study design:
#coloring according to treatment
boxplot(values~ind, toPlot, col = ifelse(pheno$agent == 'BV6', 'blue', 'white'))
It seems unrealistic that the treatment is at the basis of this shift. Let’s try with genetic variation:
#coloring according to genotype
boxplot(values~ind, toPlot, col = ifelse(pheno$genotype_variation == 'karyotype: 13q-', 'blue', 'white'))
Now we see that the shift happens for samples with a specific genetic variation. In a real analysis the first question would be: were the samples corresponding to this particular genetic background processed in a separate day? Or in another lab? If yes, then any of these differences is likely to have created a bias, namely a batch-effect, in the analysis. Note that such a shift should not be possible if all samples are quantile normalized together.
Other useful plots for quality control are correlation plots:
#library for correlation plots
library(corrplot)
#correlation plot for the data matrix
corrs <- cor(dataMatrix)
corrplot(corrs, method = 'color')
#also, sample-pairs scatter plots
sample1 <- 'GSM1528452.CEL.gz'
sample2 <- 'GSM1528451.CEL.gz'
plot(dataMatrix[ , sample1], dataMatrix[ , sample2])
All correlations are quite elevated (min correlation: 0.9554). If a single sample were to be affected by a major failure during profiling, it would stick out as an outlier in the correlation plot.
Finally, a Principal Component Analysis (PCA) plot will help assessing whether the samples groups as expected:
#graphic library
library(ggplot2)
#PCA analysis
pc <- prcomp(t(dataMatrix));
# %var explained by each component
percVar <- pc$sdev^2 / sum(pc$sdev^2);
#plotting
toPlot <- data.frame(PC1 = pc$x[, 1], PC2 = pc$x[, 2], PC3 = pc$x[, 3], treatment = pheno$agent, exposure = pheno$time, patient = pheno$individual, genotype = pheno$genotype_variation);
g <- ggplot() +
geom_point(mapping = aes(x = PC1, y = PC2, color = treatment, shape = exposure),
data = toPlot) +
xlab(paste('PC 1, %var:', round(percVar[1], 4))) +
ylab(paste('PC 2, %var:', round(percVar[2], 4))) +
theme_bw()
plot(g)
It is evident that neither the exposure drives the first component. Let’s change color schema to investigate what drives the second component.
#plotting
g <- ggplot() +
geom_point(mapping = aes(x = PC1, y = PC2, color = patient, shape = exposure),
data = toPlot) +
xlab(paste('PC 1, %var:', round(percVar[1], 4))) +
ylab(paste('PC 2, %var:', round(percVar[2], 4))) +
theme_bw()
plot(g)
Comparing the two graphs we can derive that: 1. the largest part of variation is due to time of exposure, independently by the treatment (PC 1) 2. the expression profile is patient-specific (PC 2) 3. there are minimal differences in terms of treatment
Exercise: check different factors and further components.
The dataset is composed by 54675 probesets; time to filter some of them out! A large number of probesets will make correcting for multiple testing harder.
A first possibility is excluding probesets without a known gene identifier.
#keeping only probesets associated to genes
toKeep <- which(genesSymbols != 'NA')
dataset <- dataset[toKeep, ]
dataMatrix <- dataMatrix[toKeep, ]
genesSymbols <- genesSymbols[toKeep]
Other possible ways to reduce the number of probesets are 1. collapsing probesets measuring the same gene 2. excluding probesets with low variance
The first method can be useful whenever a single value for each gene is required. This implies either selecting a single probeset for each gene (and discarding the others), or computing an average (representative) value out of the multiple probesets. A possible drawback is that distinct probesets can have very different sensitivity, thus we may smooth down the signal carried by a single sensible probeset. The function WGCNA::collapseRows() provides several alternatives for summarizing probesets measuring the same gene.
Probesets with low variance are assumed to be unlikely to provide biological information. However, it is not easy to determine the exact threshold under which a measurement is not relevant. The package genefilter provides several methods for filtering probesets out, including function based on variance.
The analysis of the principal components of the data suggest to take into account (correct for) time exposure and patient-specific effects when searching for genes differentially expressed across treatments.
We will use linear modelling for the analysis. This means that we will model the expression of each gene as resulting from a linear combination of the factors under study:
\[ y_i = \alpha_i + \alpha_1 \cdot treatment + \alpha_2 \cdot exposure + \alpha_3 \cdot patient23 + \alpha_4 \cdot patient44 + \alpha_5 \cdot patient50 + \epsilon \]
where \(y_i\) is the expression value of gene i, \(\alpha_i\) is the gene-specific baseline expression, treatment and exposure are binary variables indicating respectively the type of drug and the length of exposure, while \(\alpha_1\) and \(\alpha_2\) are they respective coefficients. The random error \(\epsilon\) models variation not accounted for in the model, while the remaining terms model patient-specific effects.
We are interested in identify the genes where \(\alpha_1\) is statistically significant from zero, even in presence of all other factors.
We use the package limma for performing the analysis. This package is specifically conceived for linear modelling for genomic data, and offers several advantages:
For more details: > Ritchie ME et al., limma powers differential expression analyses for RNA-sequencing and microarray studies. Nucleic Acids Res. 2015 Apr 20; 43(7): e47.
#we need to specify that cells treated with BV6 are "cases"
pheno$agent <- relevel(pheno$agent, 'DMSO control')
#specifying the factors to include in the linear models
mod <- model.matrix(~ agent + time + individual, pheno)
#checking the names of the factors; we are interested in "agentBV6"
colnames(mod)
## [1] "(Intercept)" "agentBV6" "timehour 4"
## [4] "individualpatient 44" "individualpatient 50" "individualpatient 51"
#modelling with limma
library(limma)
##
## Attaching package: 'limma'
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
#linear models
models <- lmFit(dataMatrix, design = mod)
#Bayesian estimation
models <- eBayes(models)
#extracting the differential expression statistics
res <- topTable(models, coef = 'agentBV6', number = dim(dataMatrix)[1], p.value = 1);
#visualizing the first results
head(res)
## logFC AveExpr t P.Value adj.P.Val
## 206363_at -0.3567991 4.510931 -5.702499 0.0001298981 0.998733
## 205965_at 0.3669120 9.258560 5.100225 0.0003278197 0.998733
## 236471_at 0.4810106 4.935700 4.888665 0.0004593546 0.998733
## 213975_s_at -1.2434425 4.497206 -4.783694 0.0005443226 0.998733
## 202638_s_at 0.5013026 7.210091 4.751096 0.0005739593 0.998733
## 1557169_x_at -0.4742879 6.426972 -4.659043 0.0006671828 0.998733
## B
## 206363_at -1.144149
## 205965_at -1.471345
## 236471_at -1.598898
## 213975_s_at -1.664758
## 202638_s_at -1.685563
## 1557169_x_at -1.745227
On the basis of these results, it is very hard to justify the existence of any difference in the expression profiles across treatments. According to the PCA plots, the exposure time might be more influential
#extracting the differential expression statistics for exposure time
res <- topTable(models, coef = 'timehour 4', number = dim(dataMatrix)[1], p.value = 1);
#visualizing the first results
head(res)
## logFC AveExpr t P.Value adj.P.Val
## 228097_at 2.633062 6.375286 31.29831 3.102956e-12 1.313264e-07
## 235010_at -2.262766 7.068141 -29.30360 6.421864e-12 1.358963e-07
## 206683_at 2.411463 8.724179 25.41204 3.085745e-11 3.379270e-07
## 221841_s_at 3.847159 6.510832 25.24899 3.312000e-11 3.379270e-07
## 202283_at -1.828119 7.575515 -24.82340 3.992239e-11 3.379270e-07
## 207630_s_at 2.216477 9.432368 23.99936 5.783475e-11 4.079567e-07
## B
## 228097_at 17.80871
## 235010_at 17.25342
## 206683_at 15.98543
## 221841_s_at 15.92617
## 202283_at 15.76892
## 207630_s_at 15.45347
This time we obtain very low FDR corrected p-values, indicating real differences within the data. Particularly, there are 13291 probesets significant at level 0.1.
Let’s visualize these probesets using an heatmap.
#library for the annotated heatmap
library(NMF)
## Loading required package: pkgmaker
## Loading required package: registry
##
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:S4Vectors':
##
## new2
## The following object is masked from 'package:base':
##
## isNamespaceLoaded
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: windows] | Cores 7/8
##
## Attaching package: 'NMF'
## The following objects are masked from 'package:S4Vectors':
##
## compare, nrun
#scaling values
scaledDataMatrix <- t(scale(t(dataMatrix)))
#selecting the significant probesets
significantProbesets <- rownames(res)[res$adj.P.Val <= 0.1];
scaledDataMatrix <- scaledDataMatrix[significantProbesets, ]
#plotting
aheatmap(scaledDataMatrix, annCol = pheno, labRow = NULL)