The Gene Expression Omnibus database is the largest collection of microarray data in the world. https://www.ncbi.nlm.nih.gov/geo/
The importance of data sharing in biological and biomedical research cannot be stressed enough: https://www.youtube.com/watch?v=5G2dgOyPYGY
The package GEOquery provides a confortable interface for downloading and interrogating the GEO dataset. First, let’s start by installing the package
#installing GEOQuery
if(!require('GEOquery', quietly = TRUE)){
#note: use 'http' if 'https' is not supported
source(file = "http://bioconductor.org/biocLite.R")
biocLite("GEOquery")
library("GEOquery")
}
The main function used by GEOquery is getGEO, that allows to download samples, series and datasets. Let us start with a single sample:
#my first sample!
mySample <- getGEO('GSM1436305');
Let’s investigate the sample more in detail:
#what class is the sample?
class(mySample)
## [1] "GSM"
## attr(,"package")
## [1] "GEOquery"
#how to access the documentation for a class
#?`GSM-class`
#?`GEOData-class`
The help functions told us the methods that can be called on GSM class. Let us see now what they do in detail:
#sample meta data
Meta(mySample)
## $channel_count
## [1] "1"
##
## $characteristics_ch1
## [1] "weight: normal weight"
##
## $contact_address
## [1] "5 University Street"
##
## $contact_city
## [1] "London"
##
## $contact_country
## [1] "United Kingdom"
##
## $contact_department
## [1] "Centre for Cardiovascular Genetics"
##
## $contact_institute
## [1] "UCL Institute of Cardiovascular Science"
##
## $contact_name
## [1] "Anastasia,Z.,Kalea"
##
## $`contact_zip/postal_code`
## [1] "WC1E 6JF"
##
## $data_processing
## [1] "Data analysis was done using Partek 6.6 software"
##
## $data_row_count
## [1] "25087"
##
## $description
## [1] "miRNA expression in gingival tissue biopsies"
##
## $extract_protocol_ch1
## [1] "mirVanaâ<U+0084>¢ miRNA Isolation Kit (Ambion®, TX, USA) following the manufacturerâ<U+0080><U+0099>s instructions"
##
## $geo_accession
## [1] "GSM1436305"
##
## $hyb_protocol
## [1] "The Affymetrix Fluidics protocol (FS450_0003) was followed for hybridization, washing, and scanning of the slides using the GeneChip Scanner 3000."
##
## $label_ch1
## [1] "biotin"
##
## $label_protocol_ch1
## [1] "FlashTag Biotin RNA Labeling Kit for Affymetrix GeneChip miRNA arrays (Genisphere, PA, USA) was used to label 500ng of total RNA by the addition of polyA-polymerase, following the manufacturerâ<U+0080><U+0099>s instructions."
##
## $last_update_date
## [1] "May 31 2017"
##
## $molecule_ch1
## [1] "total RNA"
##
## $organism_ch1
## [1] "Homo sapiens"
##
## $platform_id
## [1] "GPL16384"
##
## $scan_protocol
## [1] "Affymetrix miRNA QC tools were used for data normalization and background correction. This was done using RMA (Robust Multi-array Average) and DABG (Detection Above Background) methods to down-weigh the poorly performing probes and to detect the P-value for each probe set"
##
## $series_id
## [1] "GSE59398"
##
## $source_name_ch1
## [1] "human gingival tissue, normal weight"
##
## $status
## [1] "Public on May 31 2017"
##
## $submission_date
## [1] "Jul 14 2014"
##
## $supplementary_file
## [1] "ftp://ftp.ncbi.nlm.nih.gov/geo/samples/GSM1436nnn/GSM1436305/suppl/GSM1436305_KALEA_1.CEL.gz"
##
## $taxid_ch1
## [1] "9606"
##
## $title
## [1] "gingival tissue, normal weight 1"
##
## $type
## [1] "RNA"
#sample header
Columns(mySample)
## Column Description
## 1 ID_REF
## 2 VALUE normalized signal intensity
#data table
head(Table(mySample))
## ID_REF VALUE
## 1 14q0_st 1.73327
## 2 14qI-1_st 3.26313
## 3 14qI-1_x_st 1.97043
## 4 14qI-2_st 1.22653
## 5 14qI-3_x_st 1.24185
## 6 14qI-4_st 2.58168
This is the information in a single sample. We will now see how it works for a series:
#getting a series
mySeries <- getGEO('GSE9577', GSEMatrix = FALSE)
## Parsing....
#help!
#?`GSE-class`
A series, as the name says, is a collection of objects. Let’s have a look to its metadata:
#getting a series
Meta(mySeries)
## $contact_address
## [1] "One Brookings Drive; Campus Box 1137"
##
## $contact_city
## [1] "Saint Louis"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Biology"
##
## $contact_email
## [1] "ewelsh@biology.wustl.edu"
##
## $contact_fax
## [1] "314-935-6803"
##
## $contact_institute
## [1] "Washington University in Saint Louis"
##
## $contact_laboratory
## [1] "Pakrasi"
##
## $contact_name
## [1] "Eric,Agnew,Welsh"
##
## $contact_state
## [1] "MO"
##
## $`contact_zip/postal_code`
## [1] "63130"
##
## $email
## [1] "geo@ncbi.nlm.nih.gov"
##
## $geo_accession
## [1] "GSE9577"
##
## $institute
## [1] "NCBI NLM NIH"
##
## $last_update_date
## [1] "Mar 17 2012"
##
## $name
## [1] "Gene Expression Omnibus (GEO)"
##
## $overall_design
## [1] "Comparison of wild type (HT3) to 3 mutants (deltaPsbV HT3, delataPsbP HT3, and deltaPsbQ HT3)"
##
## $platform_id
## [1] "GPL6115"
##
## $platform_taxid
## [1] "1148"
##
## $pubmed_id
## [1] "18693241" "20858728"
##
## $relation
## [1] "BioProject: https://www.ncbi.nlm.nih.gov/bioproject/PRJNA103405"
##
## $sample_id
## [1] "GSM242040" "GSM242041" "GSM242042"
##
## $sample_taxid
## [1] "1148"
##
## $status
## [1] "Public on Aug 18 2008"
##
## $submission_date
## [1] "Nov 09 2007"
##
## $summary
## [1] "In this study, we undertook a global proteomics analysis of isolated PSII complexes, comparing protein profiles of HT3 (C-terminal hexahistidine tagged CP47 protein) to those of deltaPsbV HT3, deltaPsbP HT3, and deltaPsbQ HT3. The sensitivity of these techniques allowed for identification of not only stoichiometric components of active PSII complexes, but also for the identification of proteins that are transiently associated with PSII throughout its lifecycle, such as assembly, repair, or degradation partners. From the results, we identified an operon of unknown function which contains binding domains for photosynthetic cofactors and which we have named the cofactor integration operon (cio). Upon deletion of the operon, photosynthetic capacity is decreased, indicating a function in PSII-mediated activity."
## [2] "Keywords: proteomic, photosystem II, cyanobacteria"
##
## $supplementary_file
## [1] "ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/series/GSE9577/GSE9577_Peptide_Identifications_From_MSMS.txt"
## [2] "ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/series/GSE9577/GSE9577_Raw_Abundance_Data.txt"
## [3] "ftp://ftp.ncbi.nlm.nih.gov/pub/geo/DATA/supplementary/series/GSE9577/GSE9577_Spectra.zip"
##
## $table_begin
## [1] "Mutant_Ratios_to_Wild_Type"
##
## $title
## [1] "Global proteomic characterization of photosystem II complexes from Synechocystis 6803"
##
## $type
## [1] "Other"
##
## $web_link
## [1] "http://www.ncbi.nlm.nih.gov/geo"
In this case we have only one platform, however we can have more than one. The platform can be extracted with the GPLList method.
#platforms in the series
myPlatforms <- GPLList(mySeries)
#how many platforms?
length(myPlatforms)
## [1] 1
#info about the platform
Meta(myPlatforms[[1]])
## $contact_address
## [1] "One Brookings Drive; Campus Box 1137"
##
## $contact_city
## [1] "Saint Louis"
##
## $contact_country
## [1] "USA"
##
## $contact_department
## [1] "Biology"
##
## $contact_email
## [1] "ewelsh@biology.wustl.edu"
##
## $contact_fax
## [1] "314-935-6803"
##
## $contact_institute
## [1] "Washington University in Saint Louis"
##
## $contact_laboratory
## [1] "Pakrasi"
##
## $contact_name
## [1] "Eric,Agnew,Welsh"
##
## $contact_state
## [1] "MO"
##
## $`contact_zip/postal_code`
## [1] "63130"
##
## $data_row_count
## [1] "218"
##
## $description
## [1] "Proteins identified in photosystem II complexes from Synechocystis sp. PCC 6803"
##
## $distribution
## [1] "virtual"
##
## $geo_accession
## [1] "GPL6115"
##
## $last_update_date
## [1] "Aug 18 2008"
##
## $organism
## [1] "Synechocystis sp. PCC 6803"
##
## $status
## [1] "Public on Aug 18 2008"
##
## $submission_date
## [1] "Nov 09 2007"
##
## $taxid
## [1] "1148"
##
## $technology
## [1] "MS"
##
## $title
## [1] "Global proteomic characterization of photosystem II complexes from Synechocystis sp. PCC 6803"
This is a proteomics series with
Finally, let’s extract the list of samples:
#sample in the series
mySamples <- GSMList(mySeries)
#how many samples?
length(mySamples)
## [1] 3
#data table for one sample
head(Table(mySamples[[1]]))
## ID_REF number of unique peptides Mutant / WT ratio VALUE SE Log2 ratio
## 1 sll0258 8 0.82 -0.28 0.13
## 2 sll0427 10 1.12 0.17 0.12
## 3 sll0851 11 1.00 0.00 0.12
## 4 sll1194 7 0.92 -0.13 0.20
## 5 sll1398 1 1.76 0.81 0.37
## 6 sll1414 2 1.81 0.85 0.16
The problem with the series is that they are provided ‘as is’, meaning that they are not curated by GEO. Particularly, there is no information about phenotype and no ensurance that data across samples are comparable.
In contrast, datasets are re-preprocessed and annotated by GEO. Let us see what this means:
#dowloading a GEO dataset
myDataset <- getGEO('GDS6248')
Meta(myDataset)$description[1]
## [1] "Analysis of livers of C57BL/6J mice fed a high fat diet for up to 24 weeks. Significant body weight gain was observed after 4 weeks. Results provide insight into the effect of high fat diets on metabolism in the liver."
#extracting an ExpressionSet
myDataset <- GDS2eSet(myDataset)
#what object is this?
class(myDataset)
## [1] "ExpressionSet"
## attr(,"package")
## [1] "Biobase"
We can now extract both the RMA-preprocessed expression data and the phenotype information with the ExpressionSet methods:
#expression data
myDataMatrix <- exprs(myDataset);
myDataMatrix[1:5, 1:5]
## GSM994787 GSM994788 GSM994789 GSM994790 GSM994791
## ILMN_1243094 7.01370 7.45044 7.60080 7.25016 7.18672
## ILMN_1238674 6.82441 6.75719 6.74615 6.72456 6.65610
## ILMN_2454720 6.70336 6.71990 6.87642 6.94457 6.84473
## ILMN_3062534 7.46396 7.48809 7.42315 7.59221 7.55262
## ILMN_3140158 9.00972 8.92265 8.88785 9.22240 9.06624
#phenotype data
pheno <- pData(myDataset)
head(pheno)
## sample protocol time
## GSM994787 GSM994787 baseline 0 wk
## GSM994788 GSM994788 baseline 0 wk
## GSM994789 GSM994789 baseline 0 wk
## GSM994790 GSM994790 normal diet 2 wk
## GSM994791 GSM994791 normal diet 2 wk
## GSM994792 GSM994792 normal diet 2 wk
## description
## GSM994787 Value for GSM994787: Baseline (0 week) replicate 1; src: Baseline (0 week) replicate 1
## GSM994788 Value for GSM994788: Baseline (0 week) replicate 2; src: Baseline (0 week) replicate 2
## GSM994789 Value for GSM994789: Baseline (0 week) replicate 3; src: Baseline (0 week) replicate 3
## GSM994790 Value for GSM994790: Mice fed Normal diet for 2 weeks replicate 1; src: Mice fed Normal diet for 2 weeks replicate 1
## GSM994791 Value for GSM994791: Mice fed Normal diet for 2 weeks replicate 2; src: Mice fed Normal diet for 2 weeks replicate 2
## GSM994792 Value for GSM994792: Mice fed Normal diet for 2 weeks replicate 3; src: Mice fed Normal diet for 2 weeks replicate 3
The data are now ready to be analyzed.
The Sequence Read Archive contains sequence reads from NGS data:
https://trace.ncbi.nlm.nih.gov/Traces/sra/sra.cgi?
While a valuable resource by itself, it is not easily accessible from R. Thus, we will focus on the recount database:
https://jhubiostatistics.shinyapps.io/recount/
This repository contains preprocessed RNA-seq data from > 2000 studies. The package for accessing it is (surprise) recount
#installing recount
if(!require('recount', quietly = TRUE)){
#note: use 'http' if 'https' is not supported
source(file = "http://bioconductor.org/biocLite.R")
biocLite("recount")
library("recount")
}
We will now try to download data from the SRP064561 project.
#project of interest
project_id <- 'SRP064561';
# Download the gene-level RangedSummarizedExperiment data
download_study(project_id)
# Load the object rse_gene
load(file.path(project_id, 'rse_gene.Rdata'))
class(rse_gene)
## [1] "RangedSummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
#?`RangedSummarizedExperiment-class`
There are three types of information in a RangedSummarizedExperiment: rowData (gene information), colData (sample information) and assay (the count matrix)
#extracting the count data
count_data <- assay(rse_gene)
count_data[1:5, 1:5]
## SRR2569777 SRR2569750 SRR2569776 SRR2569775 SRR2569774
## ENSG00000000003.14 0 50 0 0 0
## ENSG00000000005.5 0 0 0 0 0
## ENSG00000000419.12 108954 101133 109995 108389 128322
## ENSG00000000457.13 25409 16162 25034 24052 32184
## ENSG00000000460.16 31272 17915 32141 28519 38266
# Extract the sample characteristics
sample_info <- colData(rse_gene)
colnames(sample_info)
## [1] "project"
## [2] "sample"
## [3] "experiment"
## [4] "run"
## [5] "read_count_as_reported_by_sra"
## [6] "reads_downloaded"
## [7] "proportion_of_reads_reported_by_sra_downloaded"
## [8] "paired_end"
## [9] "sra_misreported_paired_end"
## [10] "mapped_read_count"
## [11] "auc"
## [12] "sharq_beta_tissue"
## [13] "sharq_beta_cell_type"
## [14] "biosample_submission_date"
## [15] "biosample_publication_date"
## [16] "biosample_update_date"
## [17] "avg_read_length"
## [18] "geo_accession"
## [19] "bigwig_file"
## [20] "title"
## [21] "characteristics"
sample_info$characteristics[1]
## CharacterList of length 1
## [[1]] cell type: lymphoblast cell line: TK6 ... time: 24 dose: 1
#some more work to to for the actual pheno information
geochar <- lapply(split(sample_info, seq_len(nrow(sample_info))), geo_characteristics)
sample_info <- do.call(rbind, geochar)
head(sample_info)
## cell.type cell.line irradiation chemical time dose
## 1 lymphoblast TK6 UVC TPA 24 1
## 2 lymphoblast TK6 UVC DMSO 4 1
## 3 lymphoblast TK6 UVC TPA 24 1
## 4 lymphoblast TK6 UVC TPA 24 1
## 5 lymphoblast TK6 UVC DMSO 24 1
## 6 lymphoblast TK6 UVC DMSO 24 1
#extracting the information about the genes
gene_info <- rowData(rse_gene)
gene_info[1:5, ]
## DataFrame with 5 rows and 3 columns
## gene_id bp_length symbol
## <character> <integer> <CharacterList>
## ENSG00000000003 ENSG00000000003.14 4535 TSPAN6
## ENSG00000000005 ENSG00000000005.5 1610 TNMD
## ENSG00000000419 ENSG00000000419.12 1207 DPM1
## ENSG00000000457 ENSG00000000457.13 6883 SCYL3
## ENSG00000000460 ENSG00000000460.16 5967 C1orf112
Let’s try now another project, SRP064264:
#project of interest
project_id <- 'SRP064264';
# Download the gene-level RangedSummarizedExperiment data
download_study(project_id)
# Load the object rse_gene
load(file.path(project_id, 'rse_gene.Rdata'))
# Extract the sample characteristics
sample_info <- colData(rse_gene)
geochar <- lapply(split(sample_info, seq_len(nrow(sample_info))), geo_characteristics)
sample_info <- do.call(rbind, geochar)
sample_info[1:10, , drop = FALSE]
## characteristics
## 1 NA
## 2 NA
## 3 NA
## 4 NA
## 5 NA
## 6 NA
## 7 NA
## 8 NA
## 9 NA
## 10 NA
What happened? For some reason, the curators were not able to process the phenotype information. We can still retrieve them from the SRA website:
Ensembl is an initiative of the European Bioinformatics Institute and the Wellcome Trust Sanger Institute. It stores, organize and provide information about human and other vertebrate species genome. www.ensembl.org
Biomart is a open source database solution that has been developed specifically for biological data, and that has been deployed for several biology oriented databases. www.biomart.org
Finally, ensembl-biomart is the biomart instantation of the information collected by Ensembl: http://www.ensembl.org/biomart/martview/
The package for accessing ensembl-biomart is biomaRt
#installing recount
if(!require('biomaRt', quietly = TRUE)){
#note: use 'http' if 'https' is not supported
source(file = "http://bioconductor.org/biocLite.R")
biocLite("biomaRt")
library("biomaRt")
}
First, we must create a connection to the database we want to use.
#what are the available repository?
listMarts()
## biomart version
## 1 ENSEMBL_MART_ENSEMBL Ensembl Genes 89
## 2 ENSEMBL_MART_MOUSE Mouse strains 89
## 3 ENSEMBL_MART_SNP Ensembl Variation 89
## 4 ENSEMBL_MART_FUNCGEN Ensembl Regulation 89
#let's choose the human repository
mart <- useMart('ENSEMBL_MART_ENSEMBL')
#what are the databases in this repository?
head(listDatasets(mart))
## dataset description
## 1 pabelii_gene_ensembl Orangutan genes (PPYG2)
## 2 mfuro_gene_ensembl Ferret genes (MusPutFur1.0)
## 3 vpacos_gene_ensembl Alpaca genes (vicPac1)
## 4 oniloticus_gene_ensembl Tilapia genes (Orenil1.0)
## 5 dnovemcinctus_gene_ensembl Armadillo genes (Dasnov3.0)
## 6 pformosa_gene_ensembl Amazon molly genes (Poecilia_formosa-5.1.2)
## version
## 1 PPYG2
## 2 MusPutFur1.0
## 3 vicPac1
## 4 Orenil1.0
## 5 Dasnov3.0
## 6 Poecilia_formosa-5.1.2
#let's choose the human one
ensembl = useDataset("hsapiens_gene_ensembl", mart = mart);
We now want to create a query in order to retrieve the affymetrix probesets measuring a given gene, IKZF1
#what are the available filter?
head(listFilters(ensembl))
## name description
## 1 chromosome_name Chromosome/scaffold name
## 2 start Start
## 3 end End
## 4 band_start Band Start
## 5 band_end Band End
## 6 marker_start Marker Start
#selecting external gene name
filterName <- 'external_gene_name';
value <- 'IKZF1'
#what are the available attributes?
head(listAttributes(ensembl))
## name description page
## 1 ensembl_gene_id Gene stable ID feature_page
## 2 ensembl_transcript_id Transcript stable ID feature_page
## 3 ensembl_peptide_id Protein stable ID feature_page
## 4 ensembl_exon_id Exon stable ID feature_page
## 5 description Gene description feature_page
## 6 chromosome_name Chromosome/scaffold name feature_page
#let's choose the human repository
mart <- useMart('ENSEMBL_MART_ENSEMBL')
#what are the databases in this repository?
head(listDatasets(mart))
## dataset description
## 1 cintestinalis_gene_ensembl C.intestinalis genes (KH)
## 2 oanatinus_gene_ensembl Platypus genes (OANA5)
## 3 mmulatta_gene_ensembl Macaque genes (Mmul_8.0.1)
## 4 etelfairi_gene_ensembl Lesser hedgehog tenrec genes (TENREC)
## 5 gmorhua_gene_ensembl Cod genes (gadMor1)
## 6 tbelangeri_gene_ensembl Tree Shrew genes (tupBel1)
## version
## 1 KH
## 2 OANA5
## 3 Mmul_8.0.1
## 4 TENREC
## 5 gadMor1
## 6 tupBel1
#choosing a few
attrs <- c("ensembl_gene_id", "external_gene_name", "affy_hg_u133_plus_2");
#let's run the query!
res <- getBM(attributes = attrs,
filters = filterName,
values = value,
mart = ensembl)
res
## ensembl_gene_id external_gene_name affy_hg_u133_plus_2
## 1 ENSG00000185811 IKZF1 220704_at
## 2 ENSG00000185811 IKZF1
## 3 ENSG00000185811 IKZF1 216901_s_at
## 4 ENSG00000185811 IKZF1 205039_s_at
## 5 ENSG00000185811 IKZF1 1565818_s_at
## 6 ENSG00000185811 IKZF1 227346_at
## 7 ENSG00000185811 IKZF1 227344_at