This dataset was produced in a recent publication (https://www.nature.com/articles/srep19413) and it compares plasma extracellular vesicles RNA-seq levels across cancer patients. For the following analysis we selected only subject with colorectal cancer and healthy controls.
#loading the dataset
load('SRP061240.RData')
#count matrix
dim(counts)
## [1] 58037 300
counts[1:5, 1:5]
## SRR2105277 SRR2105278 SRR2105275 SRR2105276 SRR2105273
## ENSG00000000003.14 0 0 0 0 0
## ENSG00000000005.5 0 0 0 0 0
## ENSG00000000419.12 0 0 0 42 0
## ENSG00000000457.13 0 25 15 0 0
## ENSG00000000460.16 0 25 0 0 0
#phenotype data
dim(sample_info)
## [1] 300 3
head(sample_info)
## disease.status sex age
## 1 Cancer Male 42
## 2 Cancer Male 42
## 3 Cancer Male 54
## 4 Cancer Male 54
## 5 Cancer Male 29
## 6 Cancer Male 29
#re_encoding phenotype data
sample_info$age <- as.numeric(sample_info$age)
sample_info$disease.status <- as.numeric(sample_info$disease.status == 'Cancer');
sample_info$sex <- as.numeric(sample_info$sex == 'Male');
Both packages belongs to the Bioconductor repository. Thus, the commands for installing them are identical:
#installation
if(!require('DESeq2', quietly = TRUE)){
source("https://bioconductor.org/biocLite.R")
biocLite("DESeq2")
library('DESeq2', quietly = TRUE)
}
##
## Attaching package: 'BiocGenerics'
## The following objects are masked from 'package:parallel':
##
## clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
## clusterExport, clusterMap, parApply, parCapply, parLapply,
## parLapplyLB, parRapply, parSapply, parSapplyLB
## The following objects are masked from 'package:stats':
##
## IQR, mad, xtabs
## The following objects are masked from 'package:base':
##
## anyDuplicated, append, as.data.frame, cbind, colnames,
## do.call, duplicated, eval, evalq, Filter, Find, get, grep,
## grepl, intersect, is.unsorted, lapply, lengths, Map, mapply,
## match, mget, order, paste, pmax, pmax.int, pmin, pmin.int,
## Position, rank, rbind, Reduce, rownames, sapply, setdiff,
## sort, table, tapply, union, unique, unsplit, which, which.max,
## which.min
##
## Attaching package: 'S4Vectors'
## The following objects are masked from 'package:base':
##
## colMeans, colSums, expand.grid, rowMeans, rowSums
## Welcome to Bioconductor
##
## Vignettes contain introductory material; view with
## 'browseVignettes()'. To cite Bioconductor, see
## 'citation("Biobase")', and for packages 'citation("pkgname")'.
if(!require('edgeR', quietly = TRUE)){
source("https://bioconductor.org/biocLite.R")
biocLite("edgeR")
library('edgeR', quietly = TRUE)
}
##
## Attaching package: 'limma'
## The following object is masked from 'package:DESeq2':
##
## plotMA
## The following object is masked from 'package:BiocGenerics':
##
## plotMA
A simple histogram of the counts shows that most of the values are zero:
#values distribution
hist(counts, 50)
This issue comes from the fact that we are analyzing extracellular vescicles, not cells. This means that we can expect most of the genes not to be expressed at all. On the other side, we may not want to analyze genes whose values are constantly flat. Thus we are going to eliminate all genes who do not have on average at least 50 counts for each sample:
#keeping only relevant genes
toKeep <- apply(counts, 1, sum) > 50 * dim(counts)[2];
counts <- counts[toKeep, ];
dim(counts)
## [1] 1014 300
#histogram of the values
hist(counts[counts <= 200], 50)
Similarly to other packages, edgeR requires the data to be encapsulated in a specific object before being processed. Moreover, the disperion parameter \(\alpha_i\) must be explicitly estimated.
#object for edgeR
edgeRObject <- DGEList(counts = counts)
#design matrix
designMatrix <- model.matrix(~ age + sex + disease.status + disease.status:sex, sample_info)
#estimating dispersion
edgeRObject <- estimateDisp(y = edgeRObject, design = designMatrix)
Note that we included also the interaction disease.status:sex into the analysis. This will allow us to investigate if any gene is differentially expressed across healthy controls and subject in one gender but not in the other.
We now compute the generalized linear models corresponding to our design, and then we use it for deriving genes that are simply differentially expressed.
#fitting the models
edgeRFit <- glmFit(edgeRObject, designMatrix)
#testing for disease.status
diseaseGenes <- glmLRT(edgeRFit, coef = 'disease.status')
#top differentially expressed genes
diseaseGenes <- topTags(diseaseGenes, n = dim(counts)[1], p.value = 1)@.Data[[1]]
head(diseaseGenes)
## logFC logCPM LR PValue FDR
## ENSG00000222726.1 1.9491375 6.226780 41.24853 1.340528e-10 8.906471e-08
## ENSG00000222276.1 1.7316262 6.626148 40.72011 1.756700e-10 8.906471e-08
## ENSG00000222626.1 1.7315687 4.241677 28.77293 8.138053e-08 2.750662e-05
## ENSG00000222293.1 0.9002397 5.044431 20.67008 5.456189e-06 1.383144e-03
## ENSG00000207568.1 0.9378403 3.745587 19.34989 1.088255e-05 1.839151e-03
## ENSG00000283165.1 0.9378403 3.745587 19.34989 1.088255e-05 1.839151e-03
Alternatively, we can also check how many genes are differentially expressed taking into account the interaction.
#testing for disease.status and disease.status:sex
diseaseSexGenes <- glmLRT(edgeRFit, coef = c('disease.status', 'sex:disease.status'))
#top differentially expressed genes
diseaseSexGenes <- topTags(diseaseSexGenes, n = dim(counts)[1], p.value = 1)@.Data[[1]]
head(diseaseSexGenes)
## logFC.disease.status logFC.sex.disease.status logCPM
## ENSG00000222726.1 1.9491375 -0.9818470 6.226780
## ENSG00000222276.1 1.7316262 -0.8553260 6.626148
## ENSG00000207021.1 -0.8979103 2.9002475 4.599352
## ENSG00000131126.18 1.0692768 0.1989669 3.697346
## ENSG00000222626.1 1.7315687 -1.1213408 4.241677
## ENSG00000251862.2 -0.3359398 -0.1163393 5.056793
## LR PValue FDR
## ENSG00000222726.1 51.74285 5.810102e-12 3.152159e-09
## ENSG00000222276.1 51.60738 6.217277e-12 3.152159e-09
## ENSG00000207021.1 42.32073 6.459068e-10 2.183165e-07
## ENSG00000131126.18 37.38049 7.637140e-09 1.936015e-06
## ENSG00000222626.1 32.80605 7.520689e-08 1.525196e-05
## ENSG00000251862.2 26.74387 1.558266e-06 2.633469e-04
Let’s check the overlap among differentially expressed genes
#threshold for fdr
threshold <- 0.1;
#genes DE for disease.status
diseaseGenesDE <- rownames(diseaseGenes)[diseaseGenes$FDR <= threshold]
length(diseaseGenesDE)
## [1] 50
#genes DE for disease.status and disease.status:sex
diseaseSexGenesDE <- rownames(diseaseSexGenes)[diseaseSexGenes$FDR <= threshold]
length(diseaseSexGenesDE)
## [1] 91
#overlap
length(intersect(diseaseGenesDE, diseaseSexGenesDE)) / length(union(diseaseGenesDE, diseaseSexGenesDE))
## [1] 0.4842105
This indicates that gender has a large effect on gene deregulation across cancer patients and healthy controls.
DESeq2 is quite similar to edgeR in terms of usability. The two toos are know to provide slightly different results, though. Repeat the analysis with DESeq2 and check the overlap with the edgeR findings.