Current approaches to single-cell transcriptomic analysis are computationally intensive and require assay-specific modeling which limit their scope and generality.
Researchers from UC Berkely and Stanford University have developed a novel method that departs from standard analysis pipelines, comparing and clustering cells based not on their transcript or gene quantifications but on their transcript-compatibility read counts. In re-analysis of two landmark yet disparate single-cell RNA-Seq datasets, this method is up to two orders of magnitude faster than previous approaches, provides accurate and in some cases improved results, and is directly applicable to data from a wide variety of assays.
Overview of the Method. This figure illustrates our transcript-compatibility counts (abbreviated TCC) clustering method in a very simple, yet instructive example and highlights itsmajor differences with respect to the conventional single cell clustering approach. Here, we consider an scRNA-Seq example with K cells (only the reads coming from Cell1 and Cell2 are shown here) and a reference transcriptome consisting of three transcripts, t1, t2 and t3.
Conventional approach: Single cells are clustered based on their transcript or gene abundances (here we only focus on transcripts for concreteness). This widely adopted pipeline involves computing a (#transcripts x #cells) expression matrix by first aligning each cell’s reads to the reference. The corresponding alignment information is next to each read, which for the purpose of illustration only contains the mapped positions (the aligned reads of Cell1 are also annotated directly on the transcripts). While reads 1 and 5 are uniquely mapped to transcripts 1 and 3, reads 2, 3 and 4 are mapped to multiple transcripts (multi-mapped reads). The quantification step must therefore take into account a specific read-generating model and handle multi-mapped reads accordingly.
The proposed method: Single cells are clustered based on their transcript-compatibility counts. Our method assigns the reads of each cell to equivalence classes via the process of pseudoalignment and simply counts the number of reads that fall in each class to construct a (#eq.classes x #cells) matrix of transcript-compatibility counts. Then, the method proceeds by directly using the transcript-compatibility counts for downstream processing and single cell clustering. The underlying idea here is that even though equivalence classes may not have an explicit biological interpretation, their read counts can collectively provide us with a distinct signature of each cell’s gene expression; transcript-compatibility counts can be thought of as feature vectors and cells can be identified by their differential expression over these features. Compared to the conventional approach, our method does not attempt to resolve multi-mapped reads (no need for an assay-specific read-generating model) and only requires transcript compatibility information for each read (no need for exact read alignment).