The cluster profiler package

Bioconductor contains some quite effective packages for performing functional analysis. We will see the clusterProfiler package, which provides a set of automatic functions for enrichment analysis:

#installing GEOQuery
if(!require('clusterProfiler', quietly = TRUE)){
  #note: use 'http' if 'https' is not supported
  source(file = "http://bioconductor.org/biocLite.R")
  biocLite("clusterProfiler")
  library("clusterProfiler")
}

In order to investigate clusterProfiler functions, we will use the results obtained by analyzign the data of the GEO dataset GDS6083. Particularly, we will identify pathways that are enriched for gene differentially expressed between 4 and 20 hours after treatment in Peripheral Blood Mononuclear Cells (PBMC) of chronic lymphocytic leukemia patients.

#loading the results
load('GDS6083_timeDiffRes.RData')

#inspecting the results
head(res)
##                 logFC  AveExpr         t      P.Value    adj.P.Val
## 228097_at    2.633062 6.375286  31.29831 3.102956e-12 1.313264e-07
## 235010_at   -2.262766 7.068141 -29.30360 6.421864e-12 1.358963e-07
## 206683_at    2.411463 8.724179  25.41204 3.085745e-11 3.379270e-07
## 221841_s_at  3.847159 6.510832  25.24899 3.312000e-11 3.379270e-07
## 202283_at   -1.828119 7.575515 -24.82340 3.992239e-11 3.379270e-07
## 207630_s_at  2.216477 9.432368  23.99936 5.783475e-11 4.079567e-07
##                    B
## 228097_at   17.80871
## 235010_at   17.25342
## 206683_at   15.98543
## 221841_s_at 15.92617
## 202283_at   15.76892
## 207630_s_at 15.45347

We now need to associate the Entrez gene number to each probeset:

#loading the bioconductor annotation package
library(hgu133plus2.db)

#entrez gene
entrezIds <- as.list(hgu133plus2ENTREZID)
tmp <- names(entrezIds)
entrezIds <- as.character(entrezIds)
names(entrezIds) <- tmp
entrezIds <- entrezIds[rownames(res)]

Hypergeometric test on GO categories

We now proceed in computing the GO categories that are enriched according to the hypergeometric test.

#significance threshold
threshold <- 0.05;

#selecting the relevant genes
signGenes <- entrezIds[res$adj.P.Val <= threshold] # why using the adj. p-values?

ego <- enrichGO(gene          = signGenes,
                universe      = entrezIds,
                OrgDb         = org.Hs.eg.db,
                ont           = "BP",
                pAdjustMethod = "fdr",
                pvalueCutoff  = 0.05,
                qvalueCutoff  = 0.05,
                minGSSize     = 15,
                maxGSSize     = 500,
                readable      = TRUE)
head(ego)[, 1:7]
##                    ID
## GO:0044772 GO:0044772
## GO:0044770 GO:0044770
## GO:0043161 GO:0043161
## GO:0006281 GO:0006281
## GO:0006914 GO:0006914
## GO:0010498 GO:0010498
##                                                                  Description
## GO:0044772                               mitotic cell cycle phase transition
## GO:0044770                                       cell cycle phase transition
## GO:0043161 proteasome-mediated ubiquitin-dependent protein catabolic process
## GO:0006281                                                        DNA repair
## GO:0006914                                                         autophagy
## GO:0010498                             proteasomal protein catabolic process
##            GeneRatio   BgRatio       pvalue     p.adjust       qvalue
## GO:0044772  259/6164 440/15024 1.581758e-14 7.247616e-11 6.165527e-11
## GO:0044770  270/6164 466/15024 6.070250e-14 1.390694e-10 1.183060e-10
## GO:0043161  214/6164 358/15024 3.751281e-13 5.729456e-10 4.874032e-10
## GO:0006281  271/6164 477/15024 1.242233e-12 1.422978e-09 1.210524e-09
## GO:0006914  237/6164 414/15024 1.164354e-11 1.067014e-08 9.077055e-09
## GO:0010498  223/6164 386/15024 1.424629e-11 1.087942e-08 9.255092e-09

There are a total of 468The first results is the mitotic cell cycle phase transition (GO:0044772). This catergory has 259 genes in the 6164 (~4.2%) significant ones, out of 440 total genes in the 15024 measured ones (~2,92%. This means that the odds ratio is (259/6164) / (440/15024) = 1.4347295

We may want to keep only the level 4 GO enriched categories. We can do that with the function gofilter:

#restricting the results to level 4
filteredEgo <- gofilter(ego, level = 4);

#number of results
dim(filteredEgo)
## [1] 60  9
#first results
head(filteredEgo)[ , 1:7]
##                    ID                          Description GeneRatio
## GO:0044770 GO:0044770          cell cycle phase transition  270/6164
## GO:0051186 GO:0051186           cofactor metabolic process  202/6164
## GO:0043414 GO:0043414            macromolecule methylation  153/6164
## GO:0000075 GO:0000075                cell cycle checkpoint  131/6164
## GO:0022613 GO:0022613 ribonucleoprotein complex biogenesis  229/6164
## GO:0016236 GO:0016236                       macroautophagy  135/6164
##              BgRatio       pvalue     p.adjust       qvalue
## GO:0044770 466/15024 6.070250e-14 1.390694e-10 1.183060e-10
## GO:0051186 353/15024 4.066920e-10 1.694057e-07 1.441130e-07
## GO:0043414 259/15024 2.754510e-09 7.888227e-07 6.710493e-07
## GO:0000075 216/15024 3.844910e-09 9.787431e-07 8.326141e-07
## GO:0022613 420/15024 1.077795e-08 1.975383e-06 1.680453e-06
## GO:0016236 227/15024 1.300817e-08 2.128694e-06 1.810874e-06

The next step is of courtse visualizing the results. Let’s first see the dotplot function:

#plotting adj p-values and gene ratio
dotplot(filteredEgo)

GSEA on KEGG pathways

We will now perform a gene set enrichment analysis on KEGG pathways.

#the input must be a descending named vector of statistics
geneList <- abs(res$t)
names(geneList) <- entrezIds;
geneList <- sort(geneList, decreasing = TRUE)

#enrichment analysis
keggGSEA <- gseKEGG(geneList      = geneList,
                    organism      = 'hsa',
                    nPerm         = 10000,
                    minGSSize     = 15,
                    maxGSSize     = 500,
                    pAdjustMethod = 'fdr',
                    pvalueCutoff  = 0.05,
                    verbose       = FALSE)
head(keggGSEA)[ , 1:8]
##                ID                                 Description setSize
## hsa00190 hsa00190                   Oxidative phosphorylation     120
## hsa00280 hsa00280  Valine, leucine and isoleucine degradation      48
## hsa00520 hsa00520 Amino sugar and nucleotide sugar metabolism      45
## hsa01200 hsa01200                           Carbon metabolism     111
## hsa03013 hsa03013                               RNA transport     144
## hsa03018 hsa03018                             RNA degradation      75
##          enrichmentScore      NES    pvalue    p.adjust     qvalues
## hsa00190       0.6174200 1.423097 9.999e-05 0.000816585 0.000380079
## hsa00280       0.6842336 1.518650 9.999e-05 0.000816585 0.000380079
## hsa00520       0.6787282 1.501501 9.999e-05 0.000816585 0.000380079
## hsa01200       0.6105387 1.403699 9.999e-05 0.000816585 0.000380079
## hsa03013       0.5982056 1.386627 9.999e-05 0.000816585 0.000380079
## hsa03018       0.6416123 1.453659 9.999e-05 0.000816585 0.000380079

Let’s try to have only the upregulated and downregulated pathways:

#upregulated: the input must be the vector of statistics without taking the absolute value
geneList <- res$t
names(geneList) <- entrezIds;
geneList <- sort(geneList, decreasing = TRUE)

#enrichment analysis
keggGSEA_up <- gseKEGG(geneList   = geneList,
                    organism      = 'hsa',
                    nPerm         = 10000,
                    minGSSize     = 15,
                    maxGSSize     = 500,
                    pAdjustMethod = 'fdr',
                    pvalueCutoff  = 0.05,
                    verbose       = FALSE)
head(keggGSEA_up)[ , 1:8]
##                ID                           Description setSize
## hsa04146 hsa04146                            Peroxisome      81
## hsa00600 hsa00600               Sphingolipid metabolism      41
## hsa05219 hsa05219                        Bladder cancer      41
## hsa04340 hsa04340            Hedgehog signaling pathway      45
## hsa05130 hsa05130 Pathogenic Escherichia coli infection      50
## hsa05213 hsa05213                    Endometrial cancer      50
##          enrichmentScore       NES       pvalue     p.adjust      qvalues
## hsa04146      -0.5255274 -1.751375 0.0001758706 0.0007178015 0.0002621402
## hsa00600       0.6274208  1.922880 0.0002213859 0.0007178015 0.0002621402
## hsa05219       0.6479824  1.985896 0.0002213859 0.0007178015 0.0002621402
## hsa04340       0.6668571  2.078976 0.0002226676 0.0007178015 0.0002621402
## hsa05130       0.6795976  2.164055 0.0002233639 0.0007178015 0.0002621402
## hsa05213       0.6435807  2.049366 0.0002233639 0.0007178015 0.0002621402
#downregulated: the input must be -1 * the vector of statistics
geneList <- -1 * res$t
names(geneList) <- entrezIds;
geneList <- sort(geneList, decreasing = TRUE)

#enrichment analysis
keggGSEA_down <- gseKEGG(geneList = geneList,
                    organism      = 'hsa',
                    nPerm         = 10000,
                    minGSSize     = 15,
                    maxGSSize     = 500,
                    pAdjustMethod = 'fdr',
                    pvalueCutoff  = 0.05,
                    verbose       = FALSE)
head(keggGSEA_down)[ , 1:8]
##                ID                          Description setSize
## hsa05016 hsa05016                 Huntington's disease     180
## hsa05010 hsa05010                  Alzheimer's disease     161
## hsa00230 hsa00230                    Purine metabolism     164
## hsa05152 hsa05152                         Tuberculosis     169
## hsa04723 hsa04723 Retrograde endocannabinoid signaling     138
## hsa05012 hsa05012                  Parkinson's disease     126
##          enrichmentScore      NES       pvalue   p.adjust     qvalues
## hsa05016       0.5854221 2.176475 0.0001675884 0.00160704 0.001093225
## hsa05010       0.5971317 2.194295 0.0001690331 0.00160704 0.001093225
## hsa00230       0.5397284 1.987010 0.0001691189 0.00160704 0.001093225
## hsa05152       0.4498151 1.659525 0.0001694341 0.00160704 0.001093225
## hsa04723       0.5526556 1.988318 0.0001719986 0.00160704 0.001093225
## hsa05012       0.6543800 2.332172 0.0001729505 0.00160704 0.001093225

It is interesting to visualize the enrichment of single pathways, for example the most significant one, Oxidative phosphorylation (hsa00190):

gseaplot(keggGSEA, geneSetID = "hsa00190")

Moreover, we can also explore the association among enriched KEGG pathways with the enrichMap function:

#visualization
png(filename = 'enrichMap_keggGSEA.png', width = 3600, height = 4800, res = 300)
enrichMap(keggGSEA)
dev.off()
## png 
##   2

Enrichment analysis the hard way

The clusterProfiler package hides a number of details from the user; each time you use one of its function, R automatically downloads the definition of the required gene sets (KEGG, GO, Reactome). We will now see how to perform some of this actions by ourselves.

First, we will download the information beloning to the Oxidative phosphorylation (hsa00190) pathway from the KEGG database:

#libray
library(KEGGREST)

#conversion from kegg id to entrez
conversion <- keggConv("ncbi-geneid", "hsa")

#retrieving genes for hsa00190
tmp <- keggLink(target = 'genes', source = 'hsa00190')
  
#converting genes
tmp <- conversion[tmp];
hsa00190 <- sub('ncbi-geneid:', '', tmp)

Let’s now use the geneSetTest function from the limma package for performing a comparison between distribuions between the background statistics and the hsa00190 genes. Particularly, geneSetTest will check if, on average, genes belonging to hsa00190 have a higher rank than background genes.

#limma library
library(limma)

#extracting the background genes
geneList <- abs(res$t)
names(geneList) <- entrezIds;

#ensuring that only measured genes are in hsa00190
hsa00190 <- intersect(hsa00190, names(geneList))

#applying the test
gstRes <- geneSetTest(hsa00190, geneList)
gstRes
## [1] 6.367928e-15

As we can see, the test only returns a p-value. All the part about visualizing and organizing the resuls is finally left to the user