--- title: "User guide for performing automatic gating with `cytometree`" author: "Chariff Alkhassim, Boris Hejblum" date: "`r Sys.Date()`" output: rmarkdown::html_vignette: toc: yes vignette: > %\VignetteIndexEntry{User guide for performing automatic gating with `cytometree`} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r knitrsetup, include=FALSE} knitr::opts_chunk$set(tidy = TRUE) knitr::knit_hooks$set(small.mar = function(before, options, envir) { if (before) par(mar = c(0, 0, 0, 0)) # no margin }) ``` ![](../man/figures/logo.png){width=139px} # Introduction to `cytometree` `cytometree` is a package that implements a binary tree algorithm for the analysis of cytometry data. Its core algorithm is based on the construction of a binary tree, whose nodes represents cell subpopulations. ## Binary tree construction 1. At each node, observed cells (or "events") and markers are modeled by both a normal distribution (so *unimodal*), and a mixture of 2 normal distributions (so *bimodal*). 2. Splitting of the events at each node is done according to a normalized difference of AIC between the two distributional fit (unimodal or bimodal), allowing to pick the marker that best splits those data. 3. When AIC normalized differences $D$ are not significant anymore, the tree has been constructed, and the cells have been automatically gated (i.e. partitioned). ## Post-hoc annotation Given the unsupervised nature of the binary tree, some of the available markers may not be used to find the different cell populations present in a given sample. To recover a complete annotation, we defined, as a post processing procedure, an annotation method which allows the user to distinguish two (or three) expression levels per marker. # Examples of analysis with `cytometree` ## Diffuse large B-cell lymphoma dataset with 3 markers In this example, we will use a diffuse large B-cell lymphoma dataset (from the [FlowCAP-I challenge](http://flowcap.flowsite.org/), with only 3 markers. ### Automatic gating step First, we need to load the package `cytometree` and we can have a look at the data: ```{r, message=FALSE} library(cytometree) data(DLBCL) dim(DLBCL) head(DLBCL) ``` We have 3 markers measured `FL1`, `FL2`, `FL4` as well as the `label` obtained from manual gating, and `r nrow(DLBCL)` cells. ```{r} # getting the only the cell events with the 3 markers measured: cellevents <- DLBCL[,c("FL1", "FL2", "FL4")] # storing the maanual gating reference from FlowCAP-I: manual_labels <- DLBCL[,"label"] ``` The function `CytomeTree` builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting `labels` attribute): ```{r, message=FALSE, results='hide'} # Build the binary tree: Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.1) # Retreive the resulting partition (i.e. automatic gating): Tree_Partition <- Tree$labels ``` One can fiddle with the `t` threshold to change the desired depth of the tree. The function `plot_graph` plots a graph representing this binary tree, showing which markers were used in which order to splits the events: ```{r, fig.width = 4, fig.height=4, small.mar=TRUE} # Plot a graph of the tree (with specific graphical parameters): plot_graph(Tree, edge.arrow.size = 0.3, Vcex = 0.8, vertex.size = 30) ``` The function `plot_nodes` allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node: ```{r, small.mar=TRUE, fig.width=6.5, fig.height=6} # Plot the distribution fit for each node: plot_nodes(Tree) ``` ```{r} # Plot the distribution fit for a particular node: plot_nodes(Tree, "FL4.1") ``` ### Post-processing annotation of automatically gated subpopulations The function `Annotation` annotates the subpopulations partitioned by the binary tree: ```{r} # Run the annotation algorithm: Annot <- Annotation(Tree,plot=FALSE) Annot$combinations ``` The post-processing annotation can be compared to the incomplete annotation obtained from the tree alone: ```{r} # Annotation gotten from the tree: Tree$annotation ``` This completed annotation eases the search for specific cell types of interest, where `1` represents a positive measurement of a marker, while `0` corresponds to its absence: ```{r} # seeked cell type: FL2+FL4+ pheno_ex1 <- RetrievePops(Annot, phenotypes = list(rbind(c("FL2", 1), c("FL4", 1)))) pheno_ex1$phenotypesinfo #One can look for several cell types at once: phenotypes <- list() # FL2+FL4- phenotypes[[1]] <- rbind(c("FL2", 1), c("FL4", 0)) # FL2-FL4+ phenotypes[[2]] <- rbind(c("FL2", 0), c("FL4", 1)) # Retreive corresponding cell populations: PhenoInfos <- RetrievePops(Annot, phenotypes) PhenoInfos$phenotypesinfo ``` ### Comparison between manual and automatic gating Finally, we can compare the automatic gating obtained using `cytometree` to the original manual gating (which had `r sum(manual_labels==0)` events inconsistently gated across different operators performing the manual gating on those very same data - so those `r sum(manual_labels==0)` cells were identified as outliers in the reference label and were kept out of the evaluation criteria in the [FlowCAP-I challenge](http://flowcap.flowsite.org/)): ```{r, echo=FALSE, small.mar=TRUE, fig.width=6.5, fig.height=4, message=FALSE} # F-measure ignoring cells labeled 0 as in FlowCAP-I. # Use FmeasureC() in any other case. Fmeas <- cytometree::FmeasureC_no0(ref=manual_labels, pred=Tree_Partition) # Scatterplots. library(ggplot2) library(viridis) # Ignoring cells labeled 0 as in FlowCAP-I. # Building the data frame to scatter plot the data. FL1 <- cellevents[, "FL1"] FL2 <- cellevents[, "FL2"] FL4 <- cellevents[, "FL4"] n <- length(FL1) man_lab <- factor(as.character(manual_labels)) levels(man_lab) <- c("outliers (manual gating)", "pop1", "pop2") FL4pos <- RetrievePops(Annot, phenotypes = list(cbind("FL4", 1))) auto_lab <- factor(as.character(FL4pos$Mergedleaves)) levels(auto_lab) <- c("pop2", "pop1") Labels <- factor(c(as.character(man_lab), as.character(auto_lab))) method <- as.factor(c(rep("FlowCap-I manual gating", n), rep("CytomeTree auto gating", n))) scatter_df <- data.frame("FL2" = FL2, "FL4" = FL4, "Label" = Labels, "method" = method) p <- ggplot2::ggplot(scatter_df, ggplot2::aes_string(x = "FL2", y="FL4",colour="Label"))+ ggplot2::geom_point(alpha = 1,cex = 1)+ ggplot2::scale_colour_manual(values = c("grey", viridis::viridis(4))) + ggplot2::facet_wrap(~ method) + ggplot2::theme_bw() + ggplot2::theme(legend.position="bottom") + ggplot2::guides(colour = ggplot2::guide_legend(override.aes = list(size=3)))+ ggplot2::ggtitle(paste("F-measure (manual gating as reference - removing the outliers) =", round(Fmeas, 3))) p ``` ## PBMC sample from the Human Immunology Project In this example, we will use a dataset from the HIPC ([Human Immunology Project Consortium](https://www.immuneprofiling.org/) program where a PBMC sample from patient 12828 was analyzed by Stanford. ### Automatic gating step First, we need to load the package `cytometree` and we can have a look at the data: ```{r, message=FALSE} data(HIPC) dim(HIPC) head(HIPC) ``` We have 6 markers measured ` CCR7`, `CD4`, `CD45RA`, `HLADR`, `CD38`, `CD8` as well as the `label` obtained from manual gating, and `r nrow(HIPC)` cell. ```{r} # getting the only the cell events with the 3 markers measured: cellevents <- HIPC[,c("CCR7", "CD4", "CD45RA", "HLADR" ,"CD38" ,"CD8")] # storing the maanual gating reference from FlowCAP-I: manual_labels <- HIPC[,"label"] ``` The function `CytomeTree` builds the binary tree according to our algorithm, that gives the automatic gating (given in the resulting `labels` attribute): ```{r, message=FALSE, results='hide'} # Build the binary tree: Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.2) # Retreive the resulting partition (i.e. automatic gating): Tree_Partition <- Tree$labels ``` One can fiddle with the `t` threshold to change the desired depth of the tree. The function `plot_graph` plots a graph representing this binary tree, showing which markers were used in which order to splits the events: ```{r, fig.width = 6.2, fig.height = 6.2, small.mar = TRUE} # Plot a graph of the tree (with specific graphical parameters): plot_graph(Tree, edge.arrow.size = 0.4, Vcex = 0.45) ``` The function `plot_nodes` allows to graphically evaluate the fit of the gaussian family distribution at each node. It can also be used to investigate a particular node: ```{r, echo=FALSE, small.mar=TRUE, fig.width=6.5, fig.height=4, message=FALSE} # Plot the fit for 2 specific nodes: plot_nodes(Tree, list("CD4.1", "CD45RA.7")) ``` ### Post-processing annotation of automatically gated subpopulations The function `Annotation` annotates the subpopulations partitioned by the binary tree: ```{r} # Run the annotation algorithm: Annot <- Annotation(Tree,plot=FALSE) Annot$combinations ``` This completed annotation eases the search for specific cell types of interest, where `1` represents a positive measurement of a marker, while `0` corresponds to its absence: ```{r, fig.width=5, fig.height=5} #One can look for several cell types at once: phenotypes <- list() # CD8-CD4+CCR7-CD45RA+ CD4effector <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 1)) phenotypes[[1]] <- CD4effector # CD8-CD4+CCR7+CD45RA+ CD4naive <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 1)) phenotypes[[2]] <- CD4naive # CD8-CD4+CCR7+CD45RA- CD4CM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 1), c("CD45RA", 0)) phenotypes[[3]] <- CD4CM # CD8-CD4+CCR7+CD45RA+ CD4effectorM <- rbind(c("CD8", 0), c("CD4", 1), c("CCR7", 0), c("CD45RA", 0)) phenotypes[[4]] <- CD4effectorM # Retreive corresponding cell populations: PhenoInfos <- RetrievePops(Annot, phenotypes) # One can find informations about the seeked phenotypes in PhenoInfos$phenotypesinfo # According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7-CD45RA+ population is labeled 12 # According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+ population is labeled 5 # According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA- population is labeled 4 # According to PhenoInfos$phenotypesinfo, CD8-CD4+CCR7+CD45RA+ population is comprised # of 4 clusters labeled 8,9,10, 11. They were merged into a new cluster labeled 13. # The new partition, with merged clusters is to be found in PhenoInfos$Mergedleaves. auto_labels_merged <- PhenoInfos$Mergedleaves # Get the indices of the subpopulation comprised of classes labeled 12, 5, 4,13. subpopulation_indices <- auto_labels_merged %in% c(12, 5, 4, 13) # Scatter plot the data. CD45RA <- cellevents[subpopulation_indices, "CD45RA"] CCR7 <- cellevents[subpopulation_indices, "CCR7"] auto_lab <- factor(auto_labels_merged[subpopulation_indices]) levels(auto_lab) <- viridis::viridis(length(levels(auto_lab))) auto_lab <- as.character(auto_lab) plot(CD45RA, CCR7, col = auto_lab, main = "Automatic gating") ``` ## Semi-supervised gating It is possible to force the markers which will be used to gate the data first, using the `force_first_markers` argument of the `CytomeTree()` function. Once those forced markers are used, automatic behavior of the algorithm is resumed for the remaining markers. ```{r, message=FALSE, results='hide'} # Build the binary tree, forcing the first node to be CD4, and the second ones HLADR Tree <- CytomeTree(cellevents, minleaf = 1, t = 0.2, force_first_markers = c("CD4", "HLADR")) ``` ```{r, fig.width = 6.2, fig.height = 6.2, small.mar = TRUE} # Plot a graph of the tree (with specific graphical parameters): plot_graph(Tree, edge.arrow.size = 0.4, Vcex = 0.45) ```