Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects

Single-cell RNA-sequencing (scRNA-seq) allows heterogeneity in gene expression levels to be studied in large populations of cells. Such heterogeneity can arise from both technical and biological factors, thus making decomposing sources of variation extremely difficult.

Researchers from the European Bioinformatics Institute have developed a computationally efficient model that uses prior pathway annotation to guide inference of the biological drivers underpinning the heterogeneity. Moreover, they jointly update and improve gene set annotation and infer factors explaining variability that fall outside the existing annotation. The researchers validate their method using simulations, which demonstrate both its accuracy and its ability to scale to large datasets with up to 100,000 cells. Through applications to real data they show that their model can robustly decompose scRNA-seq datasets into interpretable components and facilitate the identification of novel sub- populations.

Factorial single-cell latent variable model: approach and motivation

rna-seq

(a) f-scLVM decomposes the matrix of single-cell gene expression profiles into factors and weights. Gene sets from pathway databases are used to annotate a subset of factors, with the remainder allowing the existence of unannotated factors. The fitted model can be used for different downstream analyses, including i) identification of biological drivers; ii) visualization of cell states; iii) data-driven adjustment of gene sets and iv) adjustment of confounding factors. (b) Bivariate visualization of 182 mouse ES cells, experimentally staged for the cell cycle, using the G2M checkpoint and P53 pathway factors. The inferred G2M checkpoint factor discriminates cells in G2/M phase from the remaining cell population. (c) Weights for the most important genes in the P53 pathways and G2M checkpoint factors, showing both genes that were pre-annotated by MSIGDB (black), and genes added by the model (red).

Availability – An open source implementation of factorial single-cell latent variable model (f-scLVM) for reviewing purposes is available at: https://github.com/PMBio/f-scLVM

Buettner F, Pratanwanich N, Marioni JC, Stegle O. (2016) Scalable latent-factor models applied to single-cell RNA-seq data separate biological drivers from confounding effects. bioRXiv [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.