Resolving single-cell heterogeneity from hundreds of thousands of cells through sequential hybrid clustering and NMF

The rapid proliferation of single-cell RNA-Sequencing (scRNA-Seq) technologies has spurred the development of diverse computational approaches to detect transcriptionally coherent populations. While the complexity of the algorithms for detecting heterogeneity has increased, most require significant user-tuning, are heavily reliant on dimension reduction techniques and are not scalable to ultra-large datasets. University of Cincinnati researchers previously described a multi-step algorithm, Iterative Clustering and Guide-gene selection (ICGS), which applies intra-gene correlation and hybrid clustering to uniquely resolve novel transcriptionally coherent cell populations from an intuitive graphical user interface.

Here, they describe a new iteration of ICGS that outperforms state-of-the-art scRNA-Seq detection workflows when applied to well-established benchmarks. This approach combines multiple complementary subtype detection methods (HOPACH, sparse-NMF, cluster “fitness”, SVM) to resolve rare and common cell-states, while minimizing differences due to donor or batch effects. Using data from multiple cell atlases, the researchers show that the PageRank algorithm effectively down-samples ultra-large scRNA-Seq datasets, without losing extremely rare or transcriptionally similar yet distinct cell-types and while recovering novel transcriptionally distinct cell populations. They believe this new approach holds tremendous promise in reproducibly resolving hidden cell populations in complex datasets.

Performance of ICGS2 against diverse alternative unsupervised scRNA-Seq algorithms


A) Overview of the ICGS2 workflow for single cell RNA-Seq population prediction. These steps include: 1) PageRank-Down-sampling (optional), 2) feature selection (ICGS), 3) dimension reduction (Sparse-NMF), 4) cluster refinement/exclusion (“fitness”) and 5) cluster assignments (Linear-SVM). B) Comparison of ICGS2 to previously evaluated algorithms and benchmarking datasets of varying size and complexity to detect prior defined cell populations. Performance of each method was evaluated by comparing the author annotated cell-to-cluster assignments to those obtained by each algorithm using the Adjusted Rand Index (ARI) (Methods). C) Comparison of ICGS2 to the top performing methods for Tabula Muris tissue scRNA-Seq (SMARTSeq2) from panel B, using an Aggregated ARI to account for contributing composite sub-clusters (see Fig. S1B for corresponding ARI values and in Table S1 for cluster numbers).

Availability – ICGS2 is implemented in Python. The source code and documentation are available at:

Venkatasubramanian M, Chetal K, Schnell D, Atluri G, Salomonis N. (2020) Resolving single-cell heterogeneity from hundreds of thousands of cells through sequential hybrid clustering and NMF. Bioinformatics [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.