--- title: "User guide to the `TcGSA` R package" author: "Anthony Devaux, Boris Hejblum" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: number_sections: yes toc: yes vignette: > %\VignetteIndexEntry{TcGSA_userguide} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r pre, echo=FALSE, warning=FALSE, include=FALSE} library(knitr) is_check <- ("CheckExEnv" %in% search()) || any(c("_R_CHECK_TIMINGS_", "_R_CHECK_LICENSE_") %in% names(Sys.getenv())) knitr::opts_chunk$set(echo = TRUE, eval=!is_check, cache=TRUE, collapse = TRUE, comment = "#>" ) #rmarkdown::render("vignettes/TcGSA_userguide.Rmd") ``` # Overview of TcGSA The `TcGSA` (Time-course Gene Set Analysis *Hejblum et al., 2015*) `R` package tests gene expression dynamics for significance in gene sets. A gene set is a group of genes, known *a priori* to share a common biological function or to be co-expressed. `TcGSA` relies on linear mixed model to take into account the potential heterogeneity of expression within a gene set. For more details, check the [published article in PLOS Computational Biology](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004310). # Getting started using TcGSA 3 inputs are required to run `TcGSA`: * The gene set object * The gene expression matrix * The design data matrix ## Gene set object A gene set is a group of genes either sharing the same biological function or . It enables to detect different gene expression and seems to be more powerful than a gene-by-gene analysis. Several definitions of groups have been made, in particular here we will focus on the following: - Chaussabel's modules (*Chaussabel et al., 2008*) - Gene Ontology database (*Ashburner et al., 2000*) - Kyoto Encyclopedia of Genes and Genomes database (*Kanehisa et Goto, 2000*) The gene set object is a `gmt` format containing: - the name of the gene set - the description of the gene set, such as biological function - the list of probes representing the genes inside each gene set One can either use already existing `gmt` objects, or build their own. ### Download gmts from *Hejblum et al., 2015* To import the `gmt`s used in *Hejblum et al., 2015*, one can download the supplementary file by running the following command: ```{r GS_import, message=FALSE} temp <- tempfile() utils::download.file("http://doi.org/10.1371/journal.pcbi.1004310.s007", destfile = temp, mode = "wb") load(unz(temp, "ReproducibleRFiles/GMTs_PLOScb.RData", open = "r")) unlink(temp) rm(temp) ``` It contains the 3 gene sets detailed above (for GO, it is only a subset of mutually exclusive gene sets with biological functions related to the immune system). *Disclaimer*: be careful with the version of the gene set databases because they are probably outdated by now. To make sure to have the latest version of the database, you can (re-)build the `gmt` object yourself following the method below. ### Self-built gmt To self-build your `gmt` object, you have to prepare a `.gmt` file. This file format is the tab delimited file which can be created with [this helpful website](https://software.broadinstitute.org/cancer/software/gsea/wiki/index.php/Data_formats#GMT:_Gene_Matrix_Transposed_file_format_.28.2A.gmt.29/) from the Broad Institute. In this file, one row represents one gene set with: - Column 1: name of the gene set - Column 2: description of the gene set - Remaining columns: list of genes included in the gene set (represented by the probes) Next, to import the `.gmt` file into R, you need to run the `GSA.read.gmt()` function from the `GSA` package. More details on the `GSA` help package. ## Gene expression matrix This matrix contains the gene expression (in cells) for each gene (in rows) of each sample (in columns) gathered from microarray measurements. The gene expression should **already** be **normalized before using `TcGSA`**. In the `rownames`, the name of each probe/gene must match with the name of probes/genes in the `gmt` object. ## Design data matrix The design data matrix contains for each sample (in row), several variables (in column). The variables required for the matrix are: * Sample names * Patient identifiers * Time measurements * _In case of multiple treatments_, the name of treatment Name of samples should be unique and match with the samples of gene expression matrix with the same order. # How to use TcGSA for one treatment group ? ## Data preparation This example comes from *Hejblum et al., 2015* and DALIA-1 HIV therapeutic vaccine trial. The aim of this study is to evaluate the immune response to HIV vaccine. To conduct this study, 19 patients contaminated by the HIV have been followed for 48 weeks split into 2 phases of 24 each: * 1^st^ phase from week 0 to week 24 (before treatment interruption): the patients were under anti-retroviral treatment and received trial vaccine on week 0, 4, 8 and 12. * 2nd phase from week 24 to week 48 (after treatment interruption): for the follow-up of patients, none of trial vaccine has been injected and anti-retroviral treatments have been interrupted (except for health problems). Blood samples have been collected at each of the different measurement time points, for each subject, to study the dynamic of gene expression over time. For more details, check the article from Hejblum et al. [here](https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1004310). The data are publicly available on [GEO website](https://www.ncbi.nlm.nih.gov/geo/) with GEO access number 'GSE46734'. We will be using the `GEOquery` package to get the data files from GEO website (see appendix for more details on `GEOquery`) ### Import of data files In this example, we need import the supplementary files available on GEOwith the `getGEOSuppFiles` function (we only need the three following files, hence the regular expression filter: gene expression pre-ATI, gene expression post-ATI, experimental design): ```{r GEOquery, include=FALSE, message=FALSE, cache=FALSE} if (!requireNamespace("GEOquery", quietly = TRUE)) { if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") } BiocManager::install("GEOquery") } ``` ```{r import_dalia, message=FALSE, results='hide'} GEOquery::getGEOSuppFiles('GSE46734', filter_regex="(*NonParamCombat*)|(*DESIGN*)") ``` ### Design data matrix The design data matrix (called `design_preATI`) is extracted from one of the GEO supplementary files. It contains the needed experimental variables, plus some additional information regarding this study. Some data processing is performed made according to the source of this paper. ```{r design_dalia} design <- read.delim(gzfile("GSE46734/GSE46734_DALIA1longitudinalTranscriptome_DESIGN_anonym.txt.gz")) design_preATI <- design[-which(design$TimePoint<0 | design$TimePoint==16 | design$TimePoint>22), ] head(design_preATI,5) ``` ```{r, include=FALSE} stopifnot(nrow(design_preATI)==90 & ncol(design_preATI)==6) ``` This data frame contains 90 samples and 6 experimental variables : - `Sample_name` for the name of samples - `Patient_ID` for the identification of patients - `TimePoint` for the time measurements - `Chip_ID` `HYB_Chamber` `HYB_Day` are the variables not required for TcGSA commands ### Gene expression matrix The gene expression matrix (called `expr_preATI`) is extracted from one of the GEO supplementary files (namely the "GSE46734_DALIA1longitudinalTranscriptome_PALO01_PreATI_NEQC_NonParamCombat.txt.gz" file). NB: The data is already normalized. ```{r expr_dalia, cache=TRUE} expr_preATI <- read.delim(gzfile("GSE46734/GSE46734_DALIA1longitudinalTranscriptome_PALO01_PreATI_NEQC_NonParamCombat.txt.gz")) rownames(expr_preATI) <- expr_preATI$PROBE_ID expr_preATI <- expr_preATI[,as.character(design_preATI$Sample_name)] expr_preATI[1:4,1:4] ``` We have: * in row `ILMN_xxxxxxx` for each probe identifier * in column `Xxxxxxxxxxx_X` for the name of each sample * each cell contains the normalized gene expression ```{r, include=FALSE} stopifnot(nrow(expr_preATI)==32978 & ncol(expr_preATI)==90) ``` The entire matrix contains $32,978$ genes and $90$ samples (number of samples should be the same as in design data matrix) ```{r dim_expr_DALIA} identical(ncol(expr_preATI), nrow(design_preATI)) ``` ## Likelihood ratios test This function provides the result of likelihood ratio test using the linear mixed model for each gene set. For this example, we use gene sets data from Chaussabel's modules (*Chaussabel et al., 2008*). `TcGSA.LR` function requires: * `expr`: name of the gene expression matrix * `gmt`: name of the gmt gene set object * `design`: name of the design data matrix * `subject_name`: name of the identification of patients in the design data matrix * `time_name`: name of the time measurements in the design data matrix ```{r LR_ST, message=FALSE, warning=FALSE} tcgsa_result <- TcGSA::TcGSA.LR(expr = expr_preATI, gmt = gmt_modulesV2, design = design_preATI, subject_name = "Patient_ID", time_name = "TimePoint") ``` ```{r tcgsa_result, echo=FALSE} tcgsa_result ``` Now `tcgsa_result` is a `tcgsa` object containing, in addition to the likelihood ratio test results: - the form of tested time trends (default option is `linear`) - the number of treatment groups and the number of gene sets tested (depending on which gene set base is defined) - a few other information To get the number of significant gene sets, one can use `summary` function on a `tcgsa` object: ```{r summary_dalia} summary(tcgsa_result) ``` To get more details on the significant gene sets, use the `signifLRT.TcGSA()` function. It returns information such as the significant gene sets among all the gene sets tested, along their p-values with adjustment for multiple testing (default option is `BY` for Benjamini-Yekutieli correction *Benjamini et Yekutieli, 2001* and 5% threshold). Below is an example of five significant gene sets: ```{r signifLRT_ST} head(TcGSA::signifLRT.TcGSA(tcgsa_result)$mixedLRTadjRes) ``` You can also use the `multtest.TcGSA` function to provide the likelihood ratios, the raw and adjusted p-values for the whole gene sets with 5% threshold. Below is an example displaying only for five results: ```{r multtest_ST} head(TcGSA::multtest.TcGSA(tcgsa_result)) ``` `CVG_H0` and `CVG_H1` are the convergence of the model under null and alternative hypotheses. `0` indicates a good convergence of the model (based on `lme4` output). ## Graphical outputs `plot1GS()` plots the different representations of gene expression in a specific gene set of interest (specified by the `geneset.name` argument). This function requires the following: - `expr`: either the name of the gene expression matrix or the estimations of linear mixed model (in this example, we used the raw data from the gene expression matrix) - `gmt`: the name of the gmt gene set object - `Subject_ID`: the name of the identification of patients in the design data matrix - `TimePoint`: the name of the time measurements of the design data matrix - `geneset.name`: the name of gene set (significant ones can be found with `signifLRT.TcGSA(tcgsa_result)$mixedLRTadjRes`) - `time_unit`: string to be displayed before to the values of `TimePoint` on the x-axis (such as 'D' for 'days' for instance - **optional**) ```{r plot1GS_ST, message=FALSE, warning=FALSE, fig.keep='all'} TcGSA::plot1GS(expr = expr_preATI, #plot1GS(expr = tcgsa_result$Estimations, gmt = gmt_modulesV2, Subject_ID = design_preATI$Patient_ID, TimePoint = design_preATI$TimePoint, clustering = FALSE, time_unit = "W", geneset.name = "M3.2", title="", margins=0.4, lab.cex=0.37, axis.cex=0.37, line.size=0.45, gg.add=list(ggplot2::theme(legend.position="none"), ggplot2::ylim(-1.26,1.26) )) ``` Dotted line shows the median gene expression across subjects, in the gene set over time. # How to use TcGSA for several treatment group ? Here we are going to take another example, from *Obermoser et al. 2013*, to study the responses to influenza and pneumococcal vaccines on healthy individuals using longitudinal gene expression. The subjects are split into three groups of 6 individuals, each receiving either 2009-2010 seasonal influenza vaccine (Fluzone), a 23-valent pneumococcal vaccine (Pneumovax23), or a placebo (saline solution). Blood samples have been acquired on day -7, 0, 1, 3, 7, 10, 14, 21 and 28 to study gene expression over time. For more details, check the article from *Obermoser et al.* [here](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3492754/). The data is available on [GEO website](https://www.ncbi.nlm.nih.gov/geo/) under the GEO access number 'GSE30101', which we will be accessing through the `GEOquery` package (see appendix for more details on `GEOquery`) ## Data preparation Here, we download the data files and import them with `getGEO()` function: ```{r import_ober, message=FALSE, warning=FALSE} utils::download.file("ftp://ftp.ncbi.nlm.nih.gov/geo/series/GSE30nnn/GSE30101/soft/GSE30101_family.soft.gz", destfile = "GSE30101_family.soft.gz", mode = "wb", cacheOK = FALSE) gse.soft <- GEOquery::getGEO(filename="GSE30101_family.soft.gz") ``` Additional processing is needed to shape our dataset into a gene expression matrix: ```{r expr_ober} probesIDs <- GEOquery::Table(GEOquery::GSMList(gse.soft)[[1]])$ID data.matrix <- do.call('cbind', lapply(GEOquery::GSMList(gse.soft), function(x) { tab <- GEOquery::Table(x) mymatch <- match(probesIDs,tab$ID_REF) return(tab$VALUE[mymatch]) })) rownames(data.matrix) <- probesIDs expr.All.ChaussVac <- apply(X = data.matrix, MARGIN = 2, FUN = as.numeric) rownames(expr.All.ChaussVac) <- probesIDs ``` The experimental design data matrix can be extracted with the following commands: ```{r design_ober} design_list <- lapply(GEOquery::GSMList(gse.soft), function(x){GEOquery::Meta(x)$characteristics_ch1}) design <- data.frame(row.names = names(design_list)) design$sample_ID <- names(design_list) s_id <- unlist(lapply(design_list, function(x){gsub("subject id: ", "", x[grep("subject id: ", x)])})) design$Subject_ID <- as.character(paste("P", s_id[design$sample_ID], sep="")) time <- unlist(lapply(design_list, function(x){gsub("day: ", "", x[grep("day: ", x)])})) time[which(time %in% c("-7", "0.5", "1", "7", "10", "14", "21", "28"))] <- paste("D", time[which(time %in% c("-7", "0.5", "1", "7", "10", "14", "21", "28"))], sep="") time[which(time %in% c("-168", "1.5", "6", "9", "12", "15", "24", "36", "48"))] <- paste("H", time[which(time %in% c("-168", "1.5", "6", "9", "12", "15", "24", "36", "48"))], sep="") design$Time <- as.character(time[design$sample_ID]) vac <- unlist(lapply(design_list, function(x){ gsub("vaccine: ", "", x[grep("vaccine: ", x)]) })) vac <- as.factor(vac) levels(vac) <- c("influenza", "influenza", "influenza", "influenza", "saline", "pneumo", "pneumo", "pneumo", "saline", "saline") design$Vaccine <- as.character(vac[design$sample_ID]) sampSet <- unlist(lapply(design_list, function(x){ gsub("sample set: ", "", x[grep("sample set: ", x)]) })) design$sampSet <- as.character(sampSet[design$sample_ID]) design$Time[which(design$sampSet=="Training_Set_Vein" & design$Time %in% c("0", "3"))] <- paste("D", design$Time[which(design$sampSet=="Training_Set_Vein" & design$Time %in% c("0", "3"))], sep="") design$Time[which(design$sampSet=="Training_Set_Finger" & design$Time %in% c("0", "3"))] <- paste("H", design$Time[which(design$sampSet=="Training_Set_Finger" & design$Time %in% c("0", "3"))], sep="") design$Time[which(design$sampSet=="Test_Set_Vein" & design$Time %in% c("0", "3"))] <- paste("D", design$Time[which(design$sampSet=="Test_Set_Vein" & design$Time %in% c("0", "3"))], sep="") design$Time[which(design$sampSet=="Test_Set_Finger" & design$Time %in% c("0", "3"))] <- paste("D", design$Time[which(design$sampSet=="Test_Set_Finger" & design$Time %in% c("0", "3"))], sep="") design$Time[which(design$sampSet=="Validation_Vein" & design$Time %in% c("0", "3"))] <- paste("D", design$Time[which(design$sampSet=="Validation_Vein" & design$Time %in% c("0", "3"))], sep="") design$Day <- gsub("D", "", design$Time) design$Day[grep("H", design$Day)] <- as.numeric(gsub("H", "", design$Day[grep("H", design$Day)]))/24 design$Day <- as.numeric(design$Day) design.All.ChaussVac <- design # Avg Baseline ----- design.All.ChaussVac.trainSetVein <- design.All.ChaussVac[which(design.All.ChaussVac$sampSet=="Training_Set_Vein"),] samplesSaline2rmv <- design.All.ChaussVac.trainSetVein[162:214,"sample_ID"] design.All.ChaussVac.noDup <- design.All.ChaussVac.trainSetVein[-which(design.All.ChaussVac.trainSetVein$sample_ID%in%samplesSaline2rmv),] design.All.ChaussVac.AvgBl <- design.All.ChaussVac.noDup[which(design.All.ChaussVac.noDup$Day!=0),] design.All.ChaussVac.AvgBl[which(design.All.ChaussVac.AvgBl$Day==-7),"Day"] <- 0 design.All.ChaussVac.AvgBl[which(design.All.ChaussVac.AvgBl$Time=="D-7"),"Time"] <- "D0" expr.All.ChaussVac.AvgBl <- expr.All.ChaussVac[, design.All.ChaussVac.AvgBl$sample_ID] for(p in unique(design.All.ChaussVac.AvgBl$Subject_ID)){ if(length(which(design.All.ChaussVac.noDup$Subject_ID==p & (design.All.ChaussVac.noDup$Day==0 | design.All.ChaussVac.noDup$Day==-7)))>1){ expr.All.ChaussVac.AvgBl[, which(design.All.ChaussVac.AvgBl$Subject_ID==p & design.All.ChaussVac.AvgBl$Day==0)] <- apply(X=cbind(expr.All.ChaussVac[, design.All.ChaussVac.noDup[which(design.All.ChaussVac.noDup$Subject_ID==p & design.All.ChaussVac.noDup$Day==0), "sample_ID"]], expr.All.ChaussVac[, design.All.ChaussVac.noDup[which(design.All.ChaussVac.noDup$Subject_ID==p & design.All.ChaussVac.noDup$Day==-7), "sample_ID"]]), MARGIN=1, FUN=mean, na.rm=TRUE) } } rownames(expr.All.ChaussVac.AvgBl) <- probesIDs if(!all.equal(as.character(design.All.ChaussVac.AvgBl$sample_ID), colnames(expr.All.ChaussVac.AvgBl))){stop("\n\n\nWARNING: EXPRESSION FILE ORDER NOT MATCHING DESIGN FILE\n\n\n")} design.All.ChaussVac.AvgBl$Subject_ID <- as.factor(design.All.ChaussVac.AvgBl$Subject_ID) design.PNEUMOvsSALINE.ChaussVac.AvgBl <- design.All.ChaussVac.AvgBl[which(design.All.ChaussVac.AvgBl$Vaccine!="influenza"), ] design.PNEUMOvsSALINE.ChaussVac.AvgBl$Vaccine <- as.factor(as.character(design.PNEUMOvsSALINE.ChaussVac.AvgBl$Vaccine)) expr.PNEUMOvsSALINE.ChaussVac.AvgBl <- expr.All.ChaussVac.AvgBl[,design.PNEUMOvsSALINE.ChaussVac.AvgBl$sample_ID] ``` ## Likelihood ratio tests This function provides the result of likelihood ratio test using the linear mixed model for each gene set. For this example, we use gene sets data from Chaussabel's modules (*Chaussabel et al., 2008*). `TcGSA.LR` function requires: - `expr`: the gene expression matrix - `gmt`: the gmt gene set object - `design`: the design data matrix - `subject_name`: the identification of patients in the design data matrix - `time_name`: the time measurements in the design data matrix - `group_name`: the group of treatment in the design data matrix ```{r LR_MT, message=FALSE, warning=FALSE} tcgsa_result_MT <- TcGSA::TcGSA.LR(expr = expr.PNEUMOvsSALINE.ChaussVac.AvgBl, gmt = gmt_modulesV2, design = design.PNEUMOvsSALINE.ChaussVac.AvgBl, subject_name = "Subject_ID", time_name = "Day", group_name = "Vaccine") summary(tcgsa_result_MT) head(TcGSA::signifLRT.TcGSA(tcgsa_result_MT)$mixedLRTadjRes) ``` ## Graphical outputs for significant gene sets ### Make clusters from tcgsa object `clustTrend` builds clusters of genes from their trends dynamics. `clustTrend` function requires: * `tcgs`: your TcGSA object * `expr`: estimation of gene expressions with linear mixed model from TcGSA object * `Subject_ID`: name of the identification of patients in the design data matrix * `TimePoint`: name of the time measurements in the design data matrix * `baseline` (_optional_): value of `TimePoint` used to be the reference * `group_of_interest`: name of a treatment in the design data matrix ```{r clust_MT, message=FALSE, warning=FALSE} clust <- TcGSA::clustTrend(tcgs = tcgsa_result_MT, expr=tcgsa_result_MT$Estimations, Subject_ID=design.PNEUMOvsSALINE.ChaussVac.AvgBl$Patient_ID, TimePoint=design.PNEUMOvsSALINE.ChaussVac.AvgBl$Day, group.var = design.PNEUMOvsSALINE.ChaussVac.AvgBl$Vaccine, group_of_interest="pneumo", ref="saline") clust ``` `clust` shows the number of trends within the significant gene sets. ### Heatmap of significant gene sets `plot` draws different kinds of graphics, but we focus on heatmap graphics. This function requires: * `x`: a `tcgsa` object * `expr`: estimation of gene expressions with linear mixed model from a `tcgsa` object * `Subject_ID`: name of the subject identifier variable in the design data matrix * `TimePoint`: name of the time measurement variable in the design data matrix * `group_of_interest`: name of the treatment factor variable in the design data matrix * `clust_trends`: cluster object with the clusters of genes from their trends dynamics ```{r heatmap_MT, message=FALSE, results='asis'} plot(x=tcgsa_result_MT, expr=tcgsa_result_MT$Estimations, Subject_ID=design.PNEUMOvsSALINE.ChaussVac.AvgBl$Patient_ID, TimePoint=design.PNEUMOvsSALINE.ChaussVac.AvgBl$TimePoint, group_of_interest="pneumo", clust_trends=clust, legend.breaks=seq(from=-2,to=2, by=0.01), time_unit="D", subtitle="Pneumo vs Saline", cex.label.row=0.5, cex.label.col=1, cex.main=0.7, heatmap.width=0.2, dendrogram.size=0.3, margins=c(2,3), heatKey.size=0.8) ``` The heatmap shows an under (blue color) or an over (red color) expression for each significant gene sets in the `pneumo` arm (vaccine) compared to the `saline` arm (compared) from the `clust` object. Similar expression dynamics are clustered through a hierarchical clustering showed through a dendrogram. **Note:** this figure is different than the one in Hejblum *et al.* because here we used a linear time function (for the sake of simplicity and computational speed). To reproduce the heatmap from the original article, one must use the `time_func` argument to specify a quadratic time function with an offset at Day 1. # References Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, et al., (2000) Gene Ontology: tool for the unification of biology. _Nat Genet_ **25**(1):25-9. Benjamini Y, Yekutieli D, (2001) The Control of the False Discovery Rate in Multiple Testing under Dependency. _Ann Stat_ **29**(4):1165-88. Chaussabel D, Quinn C, Shen J, Patel P, Glaser C, Baldwin N, et al., (2008) A Modular Analysis Framework for Blood Genomics Studies: Application to Systemic Lupus Erythematosus. _Immunity_ **29**(1):150-64. Hejblum BP, Skinner J, Thiebaut R, (2015) Time-Course Gene Set Analysis for Longitudinal Gene Expression Data. _PLOS Comput Biol_ **11**(6):e1004310. Kanehisa M, Goto S, (2000) KEGG: Kyoto Encyclopedia of Genes and Genomes. _Nucleic Acids Res_ **28**(1):27-30. Obermoser G, Presnell S, Domico K, Xu H, Wang Y, Anguiano E, *et al.*, (2013) Systems Scale Interactive Exploration Reveals Quantitative and Qualitative Differences in Response to Influenza and Pneumococcal Vaccines. _Immunity_ **38**(4):831-44. # Appendix ## GEOquery package In case the data you want to analyze is publicly available through [Gene Expression Omnibus (GEO)](https://www.ncbi.nlm.nih.gov/geo/), you can access it with the `GEOquery` package, that can be installed with the following commands: ```{r dl_GEOquery, warning=FALSE, message=FALSE, eval=FALSE} if (!requireNamespace("GEOquery", quietly = TRUE)) { if (!requireNamespace("BiocManager", quietly = TRUE)){ install.packages("BiocManager") } BiocManager::install("GEOquery") } ``` More details can be found on [Bioconductor](https://bioconductor.org/packages/release/bioc/html/GEOquery.html) and in Davis S, Meltzer P, (2007) GEOquery: a bridge between the Gene Expression Omnibus (GEO) and Bioconductor _Bioinformatics_ 14:1846-1847.