The flowCore package

The main package for loading, storing and pre-processing flow and mass-cytometry data in R is flowCore. This package is compatible with the most common industrial standars for cytometry files, namely FCS2 and FCS3. It is used by a number of other packages the offer more advanced functionalities for cytometry data, as for example flowStats and ggcyto. Part of this lesson is taken from the flowCore vignette: http://bioconductor.org/packages/release/bioc/vignettes/flowCore/inst/doc/HowTo-flowCore.pdf

Let us start by installing the flowCore package:

if(!require('flowCore', quietly = TRUE)){
  source("https://bioconductor.org/biocLite.R")
  biocLite("flowCore")  
  library('flowCore', quietly = TRUE)
}
## Warning: package 'flowCore' was built under R version 3.4.0

This package provides a set of example files that we will use during the lesson. To locate the files, we use the system.file command:

dir(system.file('extdata', package = 'flowCore'))
## [1] "0877408774.B08" "0877408774.E07" "0877408774.F06" "compdata"

The **extdata* folder contains three FCS files and one folder with files from a flow cytometry experiments.

Reading FCS single files and set of FCS files

The first operation is reading a single file:

fileName <- system.file("extdata","0877408774.B08", package="flowCore")
x <- read.FCS(fileName, alter.names = TRUE)
## Warning in version %in% c("FCS2.0", "FCS3.0", "FCS3.1"): bytecode version
## mismatch; using eval
summary(x)
##         FSC.H  SSC.H    FL1.H   FL2.H    FL3.H   FL1.A    FL4.H  Time
## Min.       85   11.0    1.000    1.00    1.000    0.00    1.000   1.0
## 1st Qu.   385  141.0    8.131   12.08    2.247    0.00    6.612 122.0
## Median    441  189.0  135.200   22.47    5.674   26.00   12.300 288.0
## Mean      492  277.9  157.800  106.00    8.465   34.08  140.400 294.8
## 3rd Qu.   518  270.0  241.400   50.94   10.750   51.00   33.380 457.5
## Max.     1023 1023.0 3652.000 9910.00 3278.000 1023.00 9822.000 626.0

This file contains seven channels, out of which we are mainly interested in the fluorescent ones (-H). The alter.names = TRUE ensure that the names of the channels do not contain any character that is not compatible with data.frame column names in R (for example the dash character ‘-’).

Several options are available when reading an FCS files. For example, only some channels can be selected:

y <- read.FCS(fileName, alter.names = TRUE, column.pattern = '.H')
summary(y)
##         FSC.H  SSC.H    FL1.H   FL2.H    FL3.H    FL4.H
## Min.       85   11.0    1.000    1.00    1.000    1.000
## 1st Qu.   385  141.0    8.131   12.08    2.247    6.612
## Median    441  189.0  135.200   22.47    5.674   12.300
## Mean      492  277.9  157.800  106.00    8.465  140.400
## 3rd Qu.   518  270.0  241.400   50.94   10.750   33.380
## Max.     1023 1023.0 3652.000 9910.00 3278.000 9822.000

other than a subset of rows:

y <- read.FCS(fileName, alter.names = TRUE, which.lines = 101:110)
summary(y)
##         FSC.H SSC.H   FL1.H   FL2.H  FL3.H  FL1.A  FL4.H Time
## Min.    193.0 140.0   4.491   5.233  1.000   0.00  7.041  6.0
## 1st Qu. 381.2 155.5   6.092   8.155  2.419   0.00  9.395  6.0
## Median  418.0 209.5   8.109  21.480  5.772   0.00 12.410  6.5
## Mean    445.2 219.9 107.000  34.370  5.563  22.60 30.480  6.5
## 3rd Qu. 488.8 269.5 199.300  45.690  8.084  40.75 45.610  7.0
## Max.    737.0 323.0 425.500 139.500 11.440 102.00 91.400  7.0

More interestingly, the data can be scaled within the interval [0; 1]:

y <- read.FCS(fileName, alter.names = TRUE, transformation="scale")
summary(y)
##           FSC.H   SSC.H     FL1.H    FL2.H     FL3.H   FL1.A     FL4.H
## Min.    0.08309 0.01075 0.0000000 0.000000 0.0000000 0.00000 0.0000000
## 1st Qu. 0.37630 0.13780 0.0007132 0.001108 0.0001247 0.00000 0.0005612
## Median  0.43110 0.18480 0.0134200 0.002147 0.0004675 0.02542 0.0011300
## Mean    0.48090 0.27170 0.0156800 0.010500 0.0007466 0.03331 0.0139400
## 3rd Qu. 0.50640 0.26390 0.0240500 0.004994 0.0009747 0.04985 0.0032380
## Max.    1.00000 1.00000 0.3651000 0.991000 0.3277000 1.00000 0.9822000
##              Time
## Min.    0.0009775
## 1st Qu. 0.1193000
## Median  0.2815000
## Mean    0.2881000
## 3rd Qu. 0.4472000
## Max.    0.6119000

The x variable is a object of class flowFrame. This class was purposedly designed in order to resemble the expressionSet class. This means that the actual data are stored within x as a matrix that can be accessed with the method exprs(), while the parameters() method allows to access meta information on the single columns.

head(exprs(x))
##      FSC.H SSC.H      FL1.H     FL2.H     FL3.H FL1.A       FL4.H Time
## [1,]   382    77 259.455272   1.00000  7.566695    55   13.097473    1
## [2,]   628   280   9.057978  48.26071 10.273508     0   28.133175    1
## [3,]  1023   735 537.611747  56.23413  6.915821   143  310.590022    1
## [4,]   373   128   6.152654  24.14418  2.329097     0    3.819718    1
## [5,]  1023  1023 259.455272 791.47554 39.241898    61 2414.418221    1
## [6,]   489   292   5.002865  28.90264  3.995421     0   26.179946    1
parameters(x)@data
##      name              desc range minRange maxRange
## $P1 FSC.H             FSC-H  1024        0     1023
## $P2 SSC.H             SSC-H  1024        0     1023
## $P3 FL1.H              <NA>  1024        1    10000
## $P4 FL2.H              <NA>  1024        1    10000
## $P5 FL3.H              <NA>  1024        1    10000
## $P6 FL1.A              <NA>  1024        0     1023
## $P7 FL4.H              <NA>  1024        1    10000
## $P8  Time Time (51.20 sec.)  1024        0     1023

Additionally, it is possible to visualize information regarding the experimental and measurement settings by inspecting the keywords of the FCS files:

head(keyword(x))
## $FCSversion
## [1] "2"
## 
## $`$BYTEORD`
## [1] "4,3,2,1"
## 
## $`$DATATYPE`
## [1] "F"
## 
## $`$NEXTDATA`
## [1] "0"
## 
## $`$SYS`
## [1] "Macintosh System Software 9.2.2"
## 
## $CREATOR
## [1] "CELLQuestª 3.3"
keyword(x)[['SAMPLE ID']]
## [1] "Default Patient ID"

Visualizing flowFrame data

The easiest way to visualize the information in a flowFrame is by using the package ggcyto.

if(!require('ggcyto', quietly = TRUE)){
  source("https://bioconductor.org/biocLite.R")
  biocLite('ggcyto')  
  library('ggcyto', quietly = TRUE)
}
## Warning: package 'ggcyto' was built under R version 3.4.0
## Warning: package 'ncdfFlow' was built under R version 3.4.0
## Warning: package 'flowWorkspace' was built under R version 3.4.0

Scatter and density plots are the most used visualization tools in cytometry. The corresponding figures can also be used for identifying The autoplot function will automatically provide either a scatter or a density plot depending whether two or one channel are indicated as input:

#scatter plot
autoplot(x, "FL1.H", "FL2.H")
## Warning: Removed 27 rows containing missing values (geom_hex).

#density plot
autoplot(x, "FL1.H")

Both plots show some quite long ties for both channels; we can transform the data in order to better visualized them:

y <- transform(x, `log.FL1.H`=log(`FL1.H`), `log.FL2.H`=log(`FL2.H`))

autoplot(y, "log.FL1.H", "log.FL2.H")
## Warning: Removed 32 rows containing missing values (geom_hex).

autoplot(y, "log.FL1.H")

Analyzing several FCS files at ones.

In a usual experiment several FCS files are produced, each corresponding to a specific experimental conditions. To keep trace of these files as a whole the flowSet class was introduced.

#reading several files in a flowSet object
fs <- read.flowSet(path=system.file("extdata", "compdata", "data", package="flowCore"), 
                   alter.names =  TRUE, phenoData=list(name="SAMPLE ID", Filename="$FIL"))

#showing fs
fs
## A flowSet with 5 experiments.
## 
## An object of class 'AnnotatedDataFrame'
##   rowNames: 060909.001 060909.002 ... 060909.005 (5 total)
##   varLabels: name Filename
##   varMetadata: labelDescription
## 
##   column names:
##   FSC.H SSC.H FL1.H FL2.H FL3.H FL1.A FL4.H
#phenotype data, as described in the read.flowSet function
pData(fs)
##            name   Filename
## 060909.001   NA 060909.001
## 060909.002 fitc 060909.002
## 060909.003   pe 060909.003
## 060909.004  apc 060909.004
## 060909.005 7AAD 060909.005

The pheno data can also contain information regarding dosages or other experimental settings:

pData(fs)$dosage <- c(1, 2, 1, 2, 1)
pData(fs)$temperature <- c(30, 30, 40, 40, 50);
pData(fs)
##            name   Filename dosage temperature
## 060909.001   NA 060909.001      1          30
## 060909.002 fitc 060909.002      2          30
## 060909.003   pe 060909.003      1          40
## 060909.004  apc 060909.004      2          40
## 060909.005 7AAD 060909.005      1          50

Single flowFrames within a flowSet can be accessed through the standard subsetting operators; alternatively, the fsApply method acts quite as the normal apply function.

#extracting  single flowFrame and computing the median of the column
x <-fs[[1]];
apply(exprs(x), 2, median)
##      FSC.H      SSC.H      FL1.H      FL2.H      FL3.H      FL1.A 
## 423.000000 128.000000   4.104698   4.531584   3.651741   0.000000 
##      FL4.H 
##   7.233942
#the same with fsApply
fsApply(fs, function(x) apply(x, 2, median), use.exprs=TRUE)
##            FSC.H SSC.H      FL1.H      FL2.H      FL3.H FL1.A      FL4.H
## 060909.001   423   128   4.104698   4.531584   3.651741     0   7.233942
## 060909.002   436   128 930.572041 228.757320  33.376247   217   8.278826
## 060909.003   438   120  10.181517 791.475544 114.444190     0   9.305720
## 060909.004   441   129   4.371445   4.869675   4.782858     0 358.663762
## 060909.005   429   133   5.002865  14.989296  63.209339     0  20.908000
#... simplified version!
fsApply(fs, each_col, median)
##            FSC.H SSC.H      FL1.H      FL2.H      FL3.H FL1.A      FL4.H
## 060909.001   423   128   4.104698   4.531584   3.651741     0   7.233942
## 060909.002   436   128 930.572041 228.757320  33.376247   217   8.278826
## 060909.003   438   120  10.181517 791.475544 114.444190     0   9.305720
## 060909.004   441   129   4.371445   4.869675   4.782858     0 358.663762
## 060909.005   429   133   5.002865  14.989296  63.209339     0  20.908000

Data transformation

We saw how transformations can greatly improve the understandability of the data. Transformations can be applied by using the tranformationList method. Firt, we must create a transformation functions:

transFunct <- arcsinhTransform(transformationId = 'arcsinh transformation with default parameters')
transFunct@.Data
## function (x) 
## asinh(a + b * x) + c
## <environment: 0x000000001f357310>

We see that transFunction internally contains an arcsinh function that smooths extremes values, much alike the logarithm function. Other transformations can be defined similarly; a number of them are already in flowCore, as for example the lnTransform for natural logarith transformations.

Let us apply the transFunct to all fluorescent channels:

# transforming
myTrans <- transformList(from = c('FL1.H', 'FL2.H', 'FL3.H', 'FL4.H'), transFunct)
fsNew <- transform(fs, myTrans)

#plotting the results
autoplot(fs, "FL1.H", "FL2.H") + ggtitle('Original values')
## Warning: Removed 26 rows containing missing values (geom_hex).

autoplot(fsNew, "FL1.H", "FL2.H") + ggtitle('Transformed values')
## Warning: Removed 16 rows containing missing values (geom_hex).

Gating and subsetting

The gating procedure is essential in order to select the subpopulation of cells one is interested in. R does not provide an interactive environment for gating, so it must be done programmatically.

#simplest gating region
rectGate <- rectangleGate(filterId="Region of interest", "FL1.H"=c(0, 5), "FL2.H"=c(0, 5))

#subsetting 
smallerFs <- Subset(fsNew, rectGate)

#comparing
fsNew[[2]]
## flowFrame object '060909.002'
## with 8805 cells and 7 observables:
##      name            desc range minRange    maxRange
## $P1 FSC.H Forward Scatter  1024 0.000000 1023.000000
## $P2 SSC.H    Side Scatter  1024 0.000000 1023.000000
## $P3 FL1.H            FITC  1024 1.443635    9.903588
## $P4 FL2.H            <NA>  1024 1.443635    9.903588
## $P5 FL3.H            <NA>  1024 1.443635    9.903588
## $P6 FL1.A            <NA>  1024 0.000000 1023.000000
## $P7 FL4.H            <NA>  1024 1.443635    9.903588
## 146 keywords are stored in the 'description' slot
smallerFs[[2]]
## flowFrame object '060909.002'
## with 40 cells and 7 observables:
##      name            desc range minRange    maxRange
## $P1 FSC.H Forward Scatter  1024 0.000000 1023.000000
## $P2 SSC.H    Side Scatter  1024 0.000000 1023.000000
## $P3 FL1.H            FITC  1024 1.443635    9.903588
## $P4 FL2.H            <NA>  1024 1.443635    9.903588
## $P5 FL3.H            <NA>  1024 1.443635    9.903588
## $P6 FL1.A            <NA>  1024 0.000000 1023.000000
## $P7 FL4.H            <NA>  1024 1.443635    9.903588
## 146 keywords are stored in the 'description' slot
#plotting
autoplot(fsNew, "FL1.H", "FL2.H") + ggtitle('Original values')
## Warning: Removed 16 rows containing missing values (geom_hex).

autoplot(smallerFs, "FL1.H", "FL2.H") + ggtitle('Transformed values')
## Warning: Removed 26 rows containing missing values (geom_hex).

Comparing distributions

One more important application is comparing the distribution of the same marker across different samples. Let’s visualize FL1.H across all FCS files:

autoplot(fsNew, "FL1.H")

It seems reasonable to think that 7AAD and apc have comparable values of FL1.H, while fitc is quite different, probably due to some difference in the experimental conditions. In order to assess if this is the case, we can apply a t-test to each comparison:

#checking the samples names
pData(fsNew)
##            name   Filename dosage temperature
## 060909.001   NA 060909.001      1          30
## 060909.002 fitc 060909.002      2          30
## 060909.003   pe 060909.003      1          40
## 060909.004  apc 060909.004      2          40
## 060909.005 7AAD 060909.005      1          50
#comparing 7AAD and apc
matrix7AAD <- exprs(fsNew[['060909.005']])
matrix_apc <- exprs(fsNew[['060909.004']])
t.test(matrix7AAD[ , 'FL1.H'], matrix_apc[ , 'FL1.H'])
## 
##  Welch Two Sample t-test
## 
## data:  matrix7AAD[, "FL1.H"] and matrix_apc[, "FL1.H"]
## t = 15.787, df = 19995, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.07731393 0.09923381
## sample estimates:
## mean of x mean of y 
##  2.508825  2.420551
#comparing 7AAD and fitc
matrix_fitc <- exprs(fsNew[['060909.002']])
t.test(matrix7AAD[ , 'FL1.H'], matrix_fitc[ , 'FL1.H'])
## 
##  Welch Two Sample t-test
## 
## data:  matrix7AAD[, "FL1.H"] and matrix_fitc[, "FL1.H"]
## t = -717.84, df = 16110, p-value < 2.2e-16
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -5.003289 -4.976039
## sample estimates:
## mean of x mean of y 
##  2.508825  7.498489

The results of the first test seems counterintuitive; a very significant p-value is assigned to a very samll difference. This problem is due to the extremely large sample sizes involved in these comparisons. In these cases not only the p-values, but also the actual difference must be taken into account for judging whether there is an actual difference between the dataset.