The development of single cell transcriptome sequencing has allowed researchers the possibility to dig inside the role of the individual cell types in a plethora of disease scenarios. It also expands to the whole transcriptome what before was only possible for a few tenths of antibodies in cell population analysis. More importantly, it allows resolving the permanent question of whether the changes observed in a particular bulk experiment are a consequence of changes in cell type proportions or an aberrant behavior of a particular cell type. However, single cell experiments are still complex to perform and expensive to sequence making bulk RNA-Seq experiments yet more common. scRNA-Seq data is proving highly relevant information for the characterization of the immune cell repertoire in different diseases ranging from cancer to atherosclerosis. In particular, as scRNA-Seq becomes more widely used, new types of immune cell populations emerge and their role in the genesis and evolution of the disease opens new avenues for personalized immune therapies. Immunotherapy have already proven successful in a variety of tumors such as breast, colon and melanoma and its value in other types of disease is being currently explored. From a statistical perspective, single-cell data are particularly interesting due to its high dimensionality, overcoming the limitations of the “skinny matrix” that traditional bulk RNA-Seq experiments yield. With the technological advances that enable sequencing hundreds of thousands of cells, scRNA-Seq data have become especially suitable for the application of Machine Learning algorithms such as Deep Learning (DL).
Researchers from the Spanish National Cardiovascular Research Centre have developed a DL based method to enumerate and quantify the immune infiltration in colorectal and breast cancer bulk RNA-Seq samples starting from scRNA-Seq. This method makes use of a Deep Neural Network (DNN) model that allows quantification not only of lymphocytes as a general population but also of specific CD8+, CD4Tmem, CD4Th and CD4Tregs subpopulations, as well as B-cells and Stromal content. Moreover, the signatures are built from scRNA-Seq data from the tumor, preserving the specific characteristics of the tumor microenvironment as opposite to other approaches in which cells were isolated from blood. The method was applied to synthetic bulk RNA-Seq and to samples from the TCGA project yielding very accurate results in terms of quantification and survival prediction.
Scheme of the DigitalDLSorter Pipeline
1) The pipeline takes a matrix of sc-RNASeq gene expression profiles (SC profiles) and a phenotype file indicating the cell type of each cell. 2) In those experiments were the number of SC profiles for each cell type is low, new SC profiles are simulated based on the real data using ZinbWave framework. 3) Real and simulated SC profiles are split into training (65%) and test (35%) sets. 4) For each training and test set, bulk samples are created by mixing 100 single cell profiles sampled according to the bulk cell type proportions simulated for each set. 5) A DNN is trained with the training set of bulk profiles together with the corresponding SC profiles. 6) The model obtained is applied to the test set, bulk and SC profiles.
Availability – The code to run the pipeline can be obtained at https://github.com/cartof/digitalDLSorter.