The ggplot2 is becoming the de facto standard for the visualization of scientific data in R. The point of strength of this package are the separation between the concepts of “data” and “visualization”, and the posssibility of easily customizing plots at a very fine level. Furthermore, there are several resources online, including a dedicated website, http://ggplot2.org/ , providing help and documentation
library(ggplot2)
Let’s start with some very simple data.
exampleData <- data.frame(values1 = c(3,2,4),
values2 = c(6,4,2),
names = letters[1:3])
exampleData
## values1 values2 names
## 1 3 6 a
## 2 2 4 b
## 3 4 2 c
The first step for producing a ggplot is to declare a “mapping” between he data and components of the plot.
p <- ggplot(data = exampleData,
mapping =aes(x = values1,
y = values2,
label = names))
Notice that we did not indicate yet what type of plot we desire. If we plot the data now, we have an empty canvas:
plot(p)
Let’s specify that we desire the data to be arranged in a scatter plot:
p + geom_point()
Here we see ggplot2 peculiar sintax: the various part of the graph are added one after the other, through the operator + Let’s see some other example; what if we want to represent the data as a line?
p + geom_line()
Or as the area below that line:
p + geom_area()
Finally, we can even print the labels, in the position specified by the x and y coordinates:
p + geom_text()
Question time: how many dimension can we represent in a single scatterplot?
Let’s create some example data. We are going to simulate a matrix containing the expression of numGenes (rows) and numSamples (columns). The outcome vector divides the samples in two groups.
numGenes <- 1000;
numSamples <- 20;
outcome <- c(rep(1, 10), rep(0, 10));
dataset <- matrix(rnorm(numGenes * numSamples),
numGenes, numSamples);
dataset <- dataset + abs(min(dataset)) + 1;
rownames(dataset) <- paste('S', 1:numGenes, sep = '')
colnames(dataset) <- paste('G', 1:numSamples, sep = '')
dataset[1:5, 1:5]
## G1 G2 G3 G4 G5
## S1 5.467192 6.559175 4.273822 5.391555 3.838972
## S2 5.591129 4.961137 5.957886 4.166299 4.697715
## S3 4.772360 4.025235 4.305237 4.476141 5.060930
## S4 4.428166 4.102886 5.980289 4.869142 4.213016
## S5 5.487550 4.500727 6.289005 5.982544 3.877276
Now, we apply the t.test on each row of the dataset:
pvalues <- apply(dataset, 1,
function(x){t.test(x[outcome == 1], x[outcome == 0])$p.value})
logPvalues <- -1 * log10(pvalues)
Similarly, we compute the log fold changes:
logFoldChanges <- apply(dataset, 1,
function(x){log2(mean(x[outcome == 1]) / mean(x[outcome == 0]))})
Here, we just finished our first differential expression analysis! Easy, isn’t it? Let’s go further and try to identify the statistically significant genes:
significant <- logPvalues >= 2;
We also assume that the genes come from two distinct and separate pathways:
pathway <- c(rep('Wnt signaling', numGenes/2), rep('MAPK signaling', numGenes/2));
And finally, we assign a transcript length to each gene:
transcriptLength <- 10 * runif(numGenes);
In order to visualize all this information in ggplot2, we must organize it as a data frame
geneNames <- rownames(dataset)
toPlot <- data.frame(geneNames,
logFoldChanges,
logPvalues,
significant,
pathway,
transcriptLength)
head(toPlot)
## geneNames logFoldChanges logPvalues significant pathway
## S1 S1 -0.18134713 0.9399505 FALSE Wnt signaling
## S2 S2 0.05799777 0.1649256 FALSE Wnt signaling
## S3 S3 0.22488214 1.2912748 FALSE Wnt signaling
## S4 S4 -0.11899440 0.4364565 FALSE Wnt signaling
## S5 S5 0.20590590 0.7861058 FALSE Wnt signaling
## S6 S6 -0.17574736 0.5296408 FALSE Wnt signaling
## transcriptLength
## S1 6.8970436
## S2 7.6000494
## S3 6.8262605
## S4 5.5869166
## S5 6.2309381
## S6 0.3890138
First, the most basic information: x and y
ggplot() +
geom_point(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues))
Now, we add the color, according to the significance:
ggplot() +
geom_point(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues,
color = significant))
Shape according to pathway:
ggplot() +
geom_point(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues,
color = significant,
shape = pathway))
Size, according to transcript lenght:
ggplot() +
geom_point(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues,
color = significant,
shape = pathway,
size = transcriptLength))
Adding an alpha channel (transparency). Notice that this parameter is outside the mapping. This means that all points will be equally transparent.
ggplot() +
geom_point(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues,
color = significant,
shape = pathway,
size = transcriptLength),
alpha = 0.7)
Finally, let’s customize our volcano plot: title, axis, legend, labels, theme:
p <- ggplot() +
geom_point(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues,
color = significant,
shape = pathway,
size = transcriptLength),
alpha = 0.7) +
ggtitle('Volcano plot') +
xlab('Log2 Fold Changes') +
ylab('Log10 p-values') +
scale_color_manual(values = c('red', 'blue')) +
scale_size_continuous(name = 'transcript length') +
theme_bw()
plot(p)
We are almost done! Now we only want to write down the names of the most relevant genes:
relevantGenes <- ifelse(logPvalues >= 2.5, geneNames, '')
toPlot$relevantGenes <- relevantGenes;
q <- p + geom_text(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues,
label = relevantGenes))
plot(q)
Something did not go as expected. The names of the genes are directly over the plot. Unfortunately, ggplot2 has no mechanisms for avoiding this type of problems.
Luckily enough, it is possible to write extension packages for ggplot2. One of this package, namely ggrepel, might be useful in this situation:
library(ggrepel)
q <- p + geom_text_repel(data = toPlot,
mapping = aes(x = logFoldChanges,
y = logPvalues,
label = relevantGenes))
plot(q)
The last example implicitly introduced the concept of “multiple layers”. Since ggplot2 graphical mecahnisms is based on adding one element after the other one, it is only natural to add multiple layers to the same plot. In order to describe this mechanisms, we are going to use an example of signal attenuation in metabolomics data.
First, let’s generate some random data
#signal attenuation in LC/MS
metaboliteValues <- 5 * rnorm(80) + 40;
qcValues <- rnorm(17) + 40;
#data frames
toPlot <- data.frame(timePoint = 1:80, value = metaboliteValues);
toPlotQC <- data.frame(timePoint = seq(1, 81, 5), value = qcValues);
Now, we can easily plot the two data frames separately:
p <- ggplot() + theme_bw() +
geom_point(data = toPlot,
aes(x = timePoint,
y = value),
color = 'gold',
size = 2,
shape = 17) +
ylim(0, 60)
plot(p)
And the QC data:
p <- ggplot() + theme_bw() +
geom_point(data = toPlotQC,
aes(x = timePoint,
y = value),
color = 'blue',
size = 2,
shape = 16) +
ylim(0, 60)
plot(p)
However, toPlot and toPlotQC have different number of elements, so these two data frames cannot be combined together. In this case, we must plot their data on two different layers:
p <- ggplot() + theme_bw() + ylim(0, 60) +
#first layer
geom_point(data = toPlot,
aes(x = timePoint,
y = value),
color = 'gold',
size = 2,
shape = 17) +
#second layer
geom_point(data = toPlotQC,
aes(x = timePoint,
y = value),
color = 'blue',
size = 2,
shape = 16) +
#third layer
geom_smooth(data = toPlotQC,
aes(x = timePoint,
y = value))
plot(p)
## `geom_smooth()` using method = 'loess'
Heatmaps are great tools for visualizing tabular data. A heatmap is a two dimensional grid where each element of the grid assumes a color that is proportional to the value reported in the data table. Optionally, rows and columns of the heatmap are re-ordered according to a hierarchical clustering of the table elements.
Unfortunately, visualization of heatmaps in ggplot is not straightforward. Thus, a better solution is using functions provided by different packages. We will see three of them
This package provide the function aheatmap which allow to plot a static heatmap. The plot comes by default with dendrograms for both columns and rows. Furthermore, the heatmap can be enriched with colored ribbons indicating how samples are distributed according to one or more grouping variables. Here an example:
library(NMF);
## Loading required package: pkgmaker
## Loading required package: registry
##
## Attaching package: 'pkgmaker'
## The following object is masked from 'package:base':
##
## isNamespaceLoaded
## Loading required package: rngtools
## Loading required package: cluster
## NMF - BioConductor layer [OK] | Shared memory capabilities [NO: windows] | Cores 3/4
aheatmap(dataset, annCol = data.frame(outcome = as.factor(outcome)))
The look-and-feel of the plot can be highly customized, as shown in the help page of the function:
?aheatmap
The d3heatmap function is similar to the previous one, however it provides some additional capabilities for the user to lively interacting with the heatmap.
library(d3heatmap)
d3heatmap(dataset[1:50, ], show_grid = FALSE)
Also in this case, the help page of the function provides a large number of options:
?d3heatmap
An important factor in how the heatmap looks, and consequently what information will convey, is given by any normalization of the data. This issue will be investigated later in the course on real-world data.
The corrplot function from the homonym package was originally devised for correlation plots. However, it can be easily customized for producing appealing heatmaps. Here we see an example of correlation plot on the well know mtcars dataset.
library(corrplot)
data(mtcars)
M <- cor(mtcars)
corrplot(M, method = 'color', order = 'hclust')