Using single-cell RNA-seq (scRNA-seq), the full transcriptome of individual cells can be acquired, enabling a quantitative cell-type characterisation based on expression profiles. However, due to the large variability in gene expression, identifying cell types based on the transcriptome remains challenging. Researchers from the Wellcome Trust Sanger Institute present Single-Cell Consensus Clustering (SC3), a tool for unsupervised clustering of scRNA-seq data. SC3 achieves high accuracy and robustness by consistently integrating different clustering solutions through a consensus approach. Tests on twelve published datasets show that SC3 outperforms five existing methods while remaining scalable, as shown by the analysis of a large dataset containing 44,808 cells. Moreover, an interactive graphical implementation makes SC3 accessible to a wide audience of users, and SC3 aids biological interpretation by identifying marker genes, differentially expressed genes and outlier cells. The researchers illustrate the capabilities of SC3 by characterising newly obtained transcriptomes from subclones of neoplastic cells collected from patients.
The SC3 framework for consensus clustering
( a ) Overview of clustering with SC3 framework (see Methods). A total of 6 D clusterings are obtained, where D is the total number of dimensions d1, …, dD considered. These clusterings are then combined through a consensus step to increase accuracy and robustness. Here, the consensus step is exemplified using the Treutlein data: the binary matrices (Methods) corresponding to each clustering are averaged, and the resulting matrix is segmented using hierarchical clustering up to the k-th hierarchical level ( k = 5 in this example). ( b ) Published datasets used to set SC3 parameters. N is the number of cells in a dataset; k is the number of clusters originally identified by the authors . ( c ) Testing the distances, nonlinear transformations and d range. Median of ARI over 100 realizations of the SC3 clustering for six gold standard datasets (Biase, Yan, Goolam, Kolodziejczyk, Deng and Pollen, colours as in ( b )). The x-axis shows the number of eigenvectors d (see ( a )) as a percentage of the total number of cells, N. The black vertical lines indicate the interval d = 4-7% of the total number of cells, N. The black vertical lines indicate the interval d = 47% of the total number of cells N, showing high accuracy in the classification. ( d ) Histogram of the d values where ARI>.95 is achieved for the gold standard datasets. The black vertical lines indicate the same as in ( c ). ( e ) 100 realizations of the SC3 clustering of the datasets shown in ( b ). Individual corresponds to clustering without consensus step. Consensus corresponds to the consensus clustering over the parameter set (Methods). The black line corresponds to ARI=0.8. Dots represent individual clustering runs. The dashed black line separates gold and silver standard datasets.