Single cell RNA-seq (scRNA-seq) experiments can provide a wealth of information about heterogeneous, multi-cellular systems. However, this information has to be inferred computationally from sequencing reads which constitute a sparse and noisy sub-sampling of the actual cellular transcriptomes. Here, University of Washington researchers present UNCURL, a unified framework for scRNA-seq data visualization, cell type identification and lineage estimation that explicitly accounts for the sequencing process. The main algorithmic novelty is a non-negative matrix factorization method that uses knowledge of the distribution resulting from the sequencing process to more accurately model the underlying cell state matrix. The researchers also developed a systematic way for incorporating prior biological information such as bulk RNA expression profiles into the cell state matrix. They found that UNCURL dramatically improves performance over state-of-the-art methods both in the absence and presence of prior knowledge. Finally they demonstrate that using UNCURL as a data preprocessing tool significantly improves the performance of existing scRNA-seq analysis algorithms.
Learning with scRNA-Seq data using UNCURL
(A) The primary input for UNCURL is the highly sampled single cell sequenced data. The user is also expected to specify the appropriate sampling distribution for the data and optionally any prior information that is known about the specific dataset. UNCURL then converts the observed sampled data to an estimated version of the true data using a novel technique called Sampled Matrix Factorization. This is then used in downstream unsupervised learning tasks. (B) Sampled Matrix Factorization using UNCURL. The true transcriptomic states of cells are assumed to lie along a continuum of states in a high dimension. These states are then sampled during the single cell sequencing process resulting in the transcript count matrix, which contain the observed states. UNCURL then reconstructs an estimated version of the true state from the observed states by a novel algorithm for ‘Sampled Matrix Factorization’, which can be viewed as an un-sampling process. (C) Comparison of fit error of all genes with data taken from using Gaussian and Poisson distributions. 96.44% of genes have lower fit error for Poisson than Gaussian distribution. (D) Some of the different types of prior information supported by UNCURL, namely bulk RNA-Seq data, Micro array data, cell type specific marker information, FISH images etc.