---
title: "User guide for executing `CytOpT` on `HIPC` data"
author: "Paul Freulon, Jérémie Bigot, Kalidou BA, Boris Hejblum"
date: "`r Sys.Date()`"
logo: "man/figures/logo.png"
output:
  rmarkdown::html_vignette:
    toc: yes
vignette: >
  %\VignetteIndexEntry{User guide for executing `CytOpT` on `HIPC` data}
  %\VignetteEngine{knitr::rmarkdown}
  %\VignetteEncoding{UTF-8}
---

```{r knitrsetup, include = FALSE}

library(reticulate)
# this vignette requires python 3.7 or newer to run
eval <- tryCatch({
  numeric_version(py_config()$version) >= "3.7" && py_numpy_available() && 
    py_module_available("scipy") && py_module_available("sklearn")
}, error = function(e) FALSE)

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  eval = eval
)
```

# Introduction

`CytOpT` is a supervised method that directly estimates the cell proportions in a flow-cytometry data set by using a source gating as its input and relies on regularized optimal transport.

# Analysis of HIPC data

As an illustrative example, we analyze here the flow cytometry data from the T-cell panel of the Human Immunology Project Consortium (HIPC) publicly available on ImmuneSpace [Gottardo et al. [2014]](https://www.nature.com/articles/nbt.2777).

An `HIPC` data set has the following structure (split into 2 files):

 - `xx_y_values`: flow-cytometry measurements
 - `xx_y_clust`: file with the corresponding manual clustering

Above, `xx` denotes the center where the data analysis was performed, and `y` denotes the patient and the replicate of the biological sample in question.


## Data load

```{r, eval=TRUE}
library(CytOpT)
data("HIPC_Stanford")
```

Here are the first few lines of the flow-cytometry measurements from patient *1228* replicate *1A*:
```{r}
knitr::kable(head(HIPC_Stanford_1228_1A))
```
The manual clustering of these data into 10 cell populations (`r paste0(levels(HIPC_Stanford_1228_1A_labels), collapse=", ")`) can be accessed from the `HIPC_Stanford_1228_1A_labels` object.

We will use the manual gating from patient *1228* replicate *1A* as our source proportions to infer proportions for patient *1369* replicate *1A*.

## Computation of the benchmark class proportions for target data

Because in this example, we know the true proportions in the target data set `HIPC_Stanford_1369_1A`, we can assess the gap between the estimate form `CytOpt` and the cellular proportions from the reference manual gating. For this purpose, we compute those manual proportions with:
```{r}
gold_standard_manual_prop <- c(table(HIPC_Stanford_1369_1A_labels)/length(HIPC_Stanford_1369_1A_labels))
```

## `CytOpT`


### Optimization

```{r}
set.seed(123)
res <- CytOpT(X_s = HIPC_Stanford_1228_1A, X_t = HIPC_Stanford_1369_1A, 
              Lab_source = HIPC_Stanford_1228_1A_labels,
              theta_true = gold_standard_manual_prop,
              method='both', monitoring = TRUE)
```


### Results

The results from `CytOpt` for both optimization algorithms are:
```{r}
summary(res)
```

Some visualizations are provided by the `plot()` method:
```{r, fig.width = 7, fig.asp = .8, fig.retina=2}
plot(res)
```

### Performance evaluation

Concordance between the manual gating gold-standard and `CytOpt` estimation can 
be graphically diagnosed with Bland-Altman plots:

```{r BA, fig.width = 7, fig.asp = .6, fig.retina = 2}
Bland_Altman(res$proportions)
```

----

The methods implemented in the `CytOpt` package are detailed in the following article:

> Paul Freulon, Jérémie Bigot, Boris P. Hejblum. CytOpT: Optimal
> Transport with Domain Adaptation for Interpreting Flow Cytometry data
> <https://arxiv.org/abs/2006.09003>