Linear models are often employed for identifying deregulated quantities in omics studies. Particularly, most of the R packages used for identifying differentially expressed genes use linear models, both in microarray and RNA-seq data (e.g., Limma, edgeR, etc.).
On the other hand, gene expression is nowadays measured with high-throughput technologies that can provide tens or even hundreds of thousands of different measurements. When results from analyzing these quantities in isolation are considered together, correcting for multiple hypothesis is mandatory.
We will not see a short guide about how to use linear models for identifying deregulated quantities in presence of covariates and how to adjust the resulting p-values for controlling the number of false discoveries.
We will start by simulating some gene expression data, along with an outcome of interest (treatment) and a few covariates (age, gender).
#study design matrix
nSamples <- 12;
phenoMatrix <- data.frame(treatment = c(rep(1, nSamples/2), rep(0, nSamples/2)),
age = 25 * (runif(nSamples) + 1),
gender = rep(c(1, 0), nSamples/2));
phenoMatrix
## treatment age gender
## 1 1 43.02260 1
## 2 1 46.89433 0
## 3 1 44.02456 1
## 4 1 47.15311 0
## 5 1 36.41202 1
## 6 1 29.15929 0
## 7 0 33.12738 1
## 8 0 37.73061 0
## 9 0 43.19263 1
## 10 0 49.74342 0
## 11 0 25.86339 1
## 12 0 28.80934 0
#simulating the expression data
nGenes <- 10000;
dataMatrix <- matrix(4 * rnorm(nGenes * nSamples) + 5, nGenes, nSamples);
#altering the expression data so that some of the genes are differentially expressed
nDifferentiallyExpressed <- 100;
deGenes <- sample(1:nGenes, nDifferentiallyExpressed, replace = FALSE);
dataMatrix[deGenes, phenoMatrix$treatment == 1] <- dataMatrix[deGenes, phenoMatrix$treatment == 1] + 10;
We will now fit different linear models using limma. Similar computations can be done with edgeR or DESeq for RNA-seq data.
The first model $ y_i = {i0} + {i1} treatment + {i2} age + {i3} gender$ simply correlates the expression value $ y_i $ of gene $ i $ with the covariates present in the study design.
#calling the package
library(limma);
#creating the model matrix
modelMatrix <- model.matrix(~ treatment + age + gender, data = phenoMatrix);
#fitting the linear model and applying limma's Bayesian correction
lms <- lmFit(dataMatrix, modelMatrix);
lms <- eBayes(lms);
The object lms now contains the one linear model for each gene fitted according to the model matrix. Let’s now check how many linear models have the $ _{i1} $ coefficient for treatment significantly different from zero.
#checking the significance of treatment
treatmentRes <- topTable(lms, coef = 'treatment', number = dim(dataMatrix)[1], p.value = 1);
head(treatmentRes)
## logFC AveExpr t P.Value adj.P.Val B
## 9120 16.70855 12.924998 6.879084 6.067947e-12 6.067947e-08 12.035963
## 9578 14.76166 9.786324 6.077526 1.225999e-09 6.129996e-06 8.242102
## 6216 14.57124 9.790322 5.999128 1.992280e-09 6.640933e-06 7.896238
## 749 14.08277 9.951378 5.798021 6.735326e-09 1.683832e-05 7.029555
## 1163 13.72271 8.189112 5.649781 1.611966e-08 3.185315e-05 6.409622
## 4053 13.61623 12.626339 5.605943 2.078048e-08 3.185315e-05 6.229370
#counting how many times the coefficient for treatment is different from zero (i.e., significant)
sum(treatmentRes$P.Value <= 0.05)
## [1] 608
One first observation is that the number of coefficient with raw p-value less than 0.05 (608) is higher than the number of differentially expressed genes (100). We will come back on this in the next section.
The key point now is tha the object lms can be used for testing other coefficients as well, for example age:
#checking the significance of age
ageRes <- topTable(lms, coef = 'age', number = dim(dataMatrix)[1], p.value = 1);
head(ageRes)
## logFC AveExpr t P.Value adj.P.Val B
## 5760 -0.5761018 5.302806 -3.629846 0.0002837669 0.6833363 -1.779571
## 5537 -0.5758233 4.027726 -3.628091 0.0002857017 0.6833363 -1.782606
## 2308 -0.5706478 4.804489 -3.595482 0.0003239867 0.6833363 -1.838728
## 1596 -0.5705581 6.413598 -3.594917 0.0003246901 0.6833363 -1.839696
## 4312 0.5632640 5.049171 3.548959 0.0003869787 0.6833363 -1.917918
## 6203 -0.5529993 3.235126 -3.484284 0.0004937200 0.6833363 -2.026294
The names of the results table can be quite confusing; however, the logFC column contains the coefficient $ _{i2} $, characterizing the association between age and genes expression (positive value: the two quantities are positively correlated; vice versa for negative values). Significance tests for gender can be obtained similarly.
One advantage of the limma approach is that more complicate models can be easily fit as well. Let’s take the model $ y_i = {i0} + {i1} treatment + {i2} age + {i3} gender + _{i4} age treatment$, where we add the interaction term between treatment and age. This model ca be fit on the data as follows:
#creating the model matrix
modelMatrix <- model.matrix(~ treatment*age + gender, data = phenoMatrix);
#fitting the linear model and applying limma's Bayesian correction
lms <- lmFit(dataMatrix, modelMatrix);
lms <- eBayes(lms);
The lms object can be used for answering more complex questions, as for example whether there is any gene that is differential expressed for treatment or that has a different response to the treatment depending on age:
#checking the significance of treatment and differences in the effect of treatment according to age
treatmentAgeRes <- topTable(lms, coef = c('treatment', 'treatment:age'), number = dim(dataMatrix)[1], p.value = 1);
head(treatmentAgeRes)
## treatment treatment.age AveExpr F P.Value
## 9120 -0.4153786 0.4358963 12.924998 24.47866 2.339114e-11
## 9578 0.1942235 0.3708196 9.786324 19.05579 5.298781e-09
## 6216 -3.7778028 0.4670819 9.790322 18.96235 5.817790e-09
## 749 6.8552570 0.1839791 9.951378 16.91493 4.507538e-08
## 8860 -24.6265547 0.9298252 11.391539 16.02341 1.099310e-07
## 1163 9.5191225 0.1070039 8.189112 15.96396 1.166647e-07
## adj.P.Val
## 9120 2.339114e-07
## 9578 1.939263e-05
## 6216 1.939263e-05
## 749 1.126885e-04
## 8860 1.897784e-04
## 1163 1.897784e-04
Since we did not include any difference in the effect of the treatment for incresing age, we do not expect to find more significant p-values in testing both treatment and treatment:age than in testing only treatment. In fact, a similar number of genes has pvalues <= 0.05 in both treatmentRes (608) and treatmentAgeRes (588, the small decrease is due to a loss in statistical power caused by including the interaction term).
Due to the definition of p-value itself, testing multiple times the same hypothesis on different data can lead to obtain low, significant p-value even if the null hypothesis under consideration is true. This type of error leads to false discoveries, i.e., rejection of the null hypothesis when the latter is true. Let’s see how this affect the results on the identification of genes differentially expressed across treatments:
#identifying gene differentially expressed across treatments according to raw p-values
treatmentRes <- treatmentRes[order(as.numeric(rownames(treatmentRes))), ]
findings <- as.numeric(treatmentRes$P.Value <= 0.05)
#tabularizing info
tmp <- deGenes;
deGenes <-vector('numeric', nGenes);
deGenes[tmp] <- 1;
tableRes <- table(deGenes, findings)
tableRes
## findings
## deGenes 0 1
## 0 9391 509
## 1 1 99
This tables shows that only 99 differentially expressed genes were correctly identified, while additional 509 were incorrectly deemed significant (false discoveries). We aim at reducing the number of false discoveries, and we will see two methods for achieving this goal: Bonferroni and Benjamini-Hochberg.
The Bonferroni method simply multiply each each p-value for the number of performed tests. In this way, the probability of having even a single false discovery is lower than the significance threshold used (e.g., 0.05). In our case:
#identifying gene differentially expressed across treatments according to bonferroni
bonferroniPvalues <- p.adjust(treatmentRes$P.Value, method = 'bonferroni');
bonferroniFindings <- as.numeric(bonferroniPvalues <= 0.05)
#tabularizing info
table(deGenes, bonferroniFindings)
## bonferroniFindings
## deGenes 0 1
## 0 9900 0
## 1 74 26
We now see that we have no false discoveries. However, the number of true discovery also diminished in an alarming way: this is always the case, since true and false discovery are in antagonistic terms.
The Benjamini-Hochberg method uses a more complicate formula for correcting the p-values, which ultimately leads to keep the ratio false discoveries / total number of discoveries below the used significance threshold. For this reason, this method is also named False Discovery Rate (FDR) control.
#identifying gene differentially expressed across treatments according FDR control
bhPvalues <- p.adjust(treatmentRes$P.Value, method = 'fdr');
bhFindings <- as.numeric(bhPvalues <= 0.05)
#tabularizing info
tableRes <- table(deGenes, bhFindings)
tableRes
## bhFindings
## deGenes 0 1
## 0 9898 2
## 1 34 66
With the FDR correction we see that the number of true discoveries has sensibly increased with respect to the Bonferroni correction, while the ratio between false discoveries and total number of discoveries (2 / (2 + 66) = 0.029) is considerably lower than the significance threshold used (0.05).