Bioconductor contains some quite effective packages for performing functional analysis. We will see the clusterProfiler package, which provides a set of automatic functions for enrichment analysis:
#installing GEOQuery
if(!require('clusterProfiler', quietly = TRUE)){
#note: use 'http' if 'https' is not supported
source(file = "http://bioconductor.org/biocLite.R")
biocLite("clusterProfiler")
library("clusterProfiler")
}
In order to investigate clusterProfiler functions, we will use the results obtained by analyzign the data of the GEO dataset GDS6083. Particularly, we will identify pathways that are enriched for gene differentially expressed between 4 and 20 hours after treatment in Peripheral Blood Mononuclear Cells (PBMC) of chronic lymphocytic leukemia patients.
#loading the results
load('GDS6083_timeDiffRes.RData')
#inspecting the results
head(res)
## logFC AveExpr t P.Value adj.P.Val
## 228097_at 2.633062 6.375286 31.29831 3.102956e-12 1.313264e-07
## 235010_at -2.262766 7.068141 -29.30360 6.421864e-12 1.358963e-07
## 206683_at 2.411463 8.724179 25.41204 3.085745e-11 3.379270e-07
## 221841_s_at 3.847159 6.510832 25.24899 3.312000e-11 3.379270e-07
## 202283_at -1.828119 7.575515 -24.82340 3.992239e-11 3.379270e-07
## 207630_s_at 2.216477 9.432368 23.99936 5.783475e-11 4.079567e-07
## B
## 228097_at 17.80871
## 235010_at 17.25342
## 206683_at 15.98543
## 221841_s_at 15.92617
## 202283_at 15.76892
## 207630_s_at 15.45347
We now need to associate the Entrez gene number to each probeset:
#loading the bioconductor annotation package
library(hgu133plus2.db)
#entrez gene
entrezIds <- as.list(hgu133plus2ENTREZID)
tmp <- names(entrezIds)
entrezIds <- as.character(entrezIds)
names(entrezIds) <- tmp
entrezIds <- entrezIds[rownames(res)]
We now proceed in computing the GO categories that are enriched according to the hypergeometric test.
#significance threshold
threshold <- 0.05;
#selecting the relevant genes
signGenes <- entrezIds[res$adj.P.Val <= threshold] # why using the adj. p-values?
ego <- enrichGO(gene = signGenes,
universe = entrezIds,
OrgDb = org.Hs.eg.db,
ont = "BP",
pAdjustMethod = "fdr",
pvalueCutoff = 0.05,
qvalueCutoff = 0.05,
minGSSize = 15,
maxGSSize = 500,
readable = TRUE)
head(ego)[, 1:7]
## ID
## GO:0044772 GO:0044772
## GO:0044770 GO:0044770
## GO:0043161 GO:0043161
## GO:0006281 GO:0006281
## GO:0006914 GO:0006914
## GO:0010498 GO:0010498
## Description
## GO:0044772 mitotic cell cycle phase transition
## GO:0044770 cell cycle phase transition
## GO:0043161 proteasome-mediated ubiquitin-dependent protein catabolic process
## GO:0006281 DNA repair
## GO:0006914 autophagy
## GO:0010498 proteasomal protein catabolic process
## GeneRatio BgRatio pvalue p.adjust qvalue
## GO:0044772 259/6164 440/15024 1.581758e-14 7.247616e-11 6.165527e-11
## GO:0044770 270/6164 466/15024 6.070250e-14 1.390694e-10 1.183060e-10
## GO:0043161 214/6164 358/15024 3.751281e-13 5.729456e-10 4.874032e-10
## GO:0006281 271/6164 477/15024 1.242233e-12 1.422978e-09 1.210524e-09
## GO:0006914 237/6164 414/15024 1.164354e-11 1.067014e-08 9.077055e-09
## GO:0010498 223/6164 386/15024 1.424629e-11 1.087942e-08 9.255092e-09
There are a total of 468The first results is the mitotic cell cycle phase transition (GO:0044772). This catergory has 259 genes in the 6164 (~4.2%) significant ones, out of 440 total genes in the 15024 measured ones (~2,92%. This means that the odds ratio is (259/6164) / (440/15024) = 1.4347295
We may want to keep only the level 4 GO enriched categories. We can do that with the function gofilter:
#restricting the results to level 4
filteredEgo <- gofilter(ego, level = 4);
#number of results
dim(filteredEgo)
## [1] 60 9
#first results
head(filteredEgo)[ , 1:7]
## ID Description GeneRatio
## GO:0044770 GO:0044770 cell cycle phase transition 270/6164
## GO:0051186 GO:0051186 cofactor metabolic process 202/6164
## GO:0043414 GO:0043414 macromolecule methylation 153/6164
## GO:0000075 GO:0000075 cell cycle checkpoint 131/6164
## GO:0022613 GO:0022613 ribonucleoprotein complex biogenesis 229/6164
## GO:0016236 GO:0016236 macroautophagy 135/6164
## BgRatio pvalue p.adjust qvalue
## GO:0044770 466/15024 6.070250e-14 1.390694e-10 1.183060e-10
## GO:0051186 353/15024 4.066920e-10 1.694057e-07 1.441130e-07
## GO:0043414 259/15024 2.754510e-09 7.888227e-07 6.710493e-07
## GO:0000075 216/15024 3.844910e-09 9.787431e-07 8.326141e-07
## GO:0022613 420/15024 1.077795e-08 1.975383e-06 1.680453e-06
## GO:0016236 227/15024 1.300817e-08 2.128694e-06 1.810874e-06
The next step is of courtse visualizing the results. Let’s first see the dotplot function:
#plotting adj p-values and gene ratio
dotplot(filteredEgo)
We will now perform a gene set enrichment analysis on KEGG pathways.
#the input must be a descending named vector of statistics
geneList <- abs(res$t)
names(geneList) <- entrezIds;
geneList <- sort(geneList, decreasing = TRUE)
#enrichment analysis
keggGSEA <- gseKEGG(geneList = geneList,
organism = 'hsa',
nPerm = 10000,
minGSSize = 15,
maxGSSize = 500,
pAdjustMethod = 'fdr',
pvalueCutoff = 0.05,
verbose = FALSE)
head(keggGSEA)[ , 1:8]
## ID Description setSize
## hsa00190 hsa00190 Oxidative phosphorylation 120
## hsa00280 hsa00280 Valine, leucine and isoleucine degradation 48
## hsa00520 hsa00520 Amino sugar and nucleotide sugar metabolism 45
## hsa01200 hsa01200 Carbon metabolism 111
## hsa03013 hsa03013 RNA transport 144
## hsa03018 hsa03018 RNA degradation 75
## enrichmentScore NES pvalue p.adjust qvalues
## hsa00190 0.6174200 1.423097 9.999e-05 0.000816585 0.000380079
## hsa00280 0.6842336 1.518650 9.999e-05 0.000816585 0.000380079
## hsa00520 0.6787282 1.501501 9.999e-05 0.000816585 0.000380079
## hsa01200 0.6105387 1.403699 9.999e-05 0.000816585 0.000380079
## hsa03013 0.5982056 1.386627 9.999e-05 0.000816585 0.000380079
## hsa03018 0.6416123 1.453659 9.999e-05 0.000816585 0.000380079
Let’s try to have only the upregulated and downregulated pathways:
#upregulated: the input must be the vector of statistics without taking the absolute value
geneList <- res$t
names(geneList) <- entrezIds;
geneList <- sort(geneList, decreasing = TRUE)
#enrichment analysis
keggGSEA_up <- gseKEGG(geneList = geneList,
organism = 'hsa',
nPerm = 10000,
minGSSize = 15,
maxGSSize = 500,
pAdjustMethod = 'fdr',
pvalueCutoff = 0.05,
verbose = FALSE)
head(keggGSEA_up)[ , 1:8]
## ID Description setSize
## hsa04146 hsa04146 Peroxisome 81
## hsa00600 hsa00600 Sphingolipid metabolism 41
## hsa05219 hsa05219 Bladder cancer 41
## hsa04340 hsa04340 Hedgehog signaling pathway 45
## hsa05130 hsa05130 Pathogenic Escherichia coli infection 50
## hsa05213 hsa05213 Endometrial cancer 50
## enrichmentScore NES pvalue p.adjust qvalues
## hsa04146 -0.5255274 -1.751375 0.0001758706 0.0007178015 0.0002621402
## hsa00600 0.6274208 1.922880 0.0002213859 0.0007178015 0.0002621402
## hsa05219 0.6479824 1.985896 0.0002213859 0.0007178015 0.0002621402
## hsa04340 0.6668571 2.078976 0.0002226676 0.0007178015 0.0002621402
## hsa05130 0.6795976 2.164055 0.0002233639 0.0007178015 0.0002621402
## hsa05213 0.6435807 2.049366 0.0002233639 0.0007178015 0.0002621402
#downregulated: the input must be -1 * the vector of statistics
geneList <- -1 * res$t
names(geneList) <- entrezIds;
geneList <- sort(geneList, decreasing = TRUE)
#enrichment analysis
keggGSEA_down <- gseKEGG(geneList = geneList,
organism = 'hsa',
nPerm = 10000,
minGSSize = 15,
maxGSSize = 500,
pAdjustMethod = 'fdr',
pvalueCutoff = 0.05,
verbose = FALSE)
head(keggGSEA_down)[ , 1:8]
## ID Description setSize
## hsa05016 hsa05016 Huntington's disease 180
## hsa05010 hsa05010 Alzheimer's disease 161
## hsa00230 hsa00230 Purine metabolism 164
## hsa05152 hsa05152 Tuberculosis 169
## hsa04723 hsa04723 Retrograde endocannabinoid signaling 138
## hsa05012 hsa05012 Parkinson's disease 126
## enrichmentScore NES pvalue p.adjust qvalues
## hsa05016 0.5854221 2.176475 0.0001675884 0.00160704 0.001093225
## hsa05010 0.5971317 2.194295 0.0001690331 0.00160704 0.001093225
## hsa00230 0.5397284 1.987010 0.0001691189 0.00160704 0.001093225
## hsa05152 0.4498151 1.659525 0.0001694341 0.00160704 0.001093225
## hsa04723 0.5526556 1.988318 0.0001719986 0.00160704 0.001093225
## hsa05012 0.6543800 2.332172 0.0001729505 0.00160704 0.001093225
It is interesting to visualize the enrichment of single pathways, for example the most significant one, Oxidative phosphorylation (hsa00190):
gseaplot(keggGSEA, geneSetID = "hsa00190")
Moreover, we can also explore the association among enriched KEGG pathways with the enrichMap function:
#visualization
png(filename = 'enrichMap_keggGSEA.png', width = 3600, height = 4800, res = 300)
enrichMap(keggGSEA)
dev.off()
## png
## 2
The clusterProfiler package hides a number of details from the user; each time you use one of its function, R automatically downloads the definition of the required gene sets (KEGG, GO, Reactome). We will now see how to perform some of this actions by ourselves.
First, we will download the information beloning to the Oxidative phosphorylation (hsa00190) pathway from the KEGG database:
#libray
library(KEGGREST)
#conversion from kegg id to entrez
conversion <- keggConv("ncbi-geneid", "hsa")
#retrieving genes for hsa00190
tmp <- keggLink(target = 'genes', source = 'hsa00190')
#converting genes
tmp <- conversion[tmp];
hsa00190 <- sub('ncbi-geneid:', '', tmp)
Let’s now use the geneSetTest function from the limma package for performing a comparison between distribuions between the background statistics and the hsa00190 genes. Particularly, geneSetTest will check if, on average, genes belonging to hsa00190 have a higher rank than background genes.
#limma library
library(limma)
#extracting the background genes
geneList <- abs(res$t)
names(geneList) <- entrezIds;
#ensuring that only measured genes are in hsa00190
hsa00190 <- intersect(hsa00190, names(geneList))
#applying the test
gstRes <- geneSetTest(hsa00190, geneList)
gstRes
## [1] 6.367928e-15
As we can see, the test only returns a p-value. All the part about visualizing and organizing the resuls is finally left to the user