The ability to discover new cell populations by unsupervised clustering of single-cell transcriptomics data has revolutionized biology. However, all unsupervised methods have adjustable parameters, which renders it difficult for researchers to decide on the right resolution for clustering. Often it is necessary to have prior expectations about the number of cell types which might bias the clustering outcome. If the data is over-clustered then the clusters are purely driven by random noise and if the data is under-clustered, interesting phenotypes could be overlooked.
To address this problem, we have developed SIGnal-Measurement-Angle (SIGMA), a clusterability measure for scRNA-seq data. It leverages concepts from random matrix theory and low-rank perturbation to derive, purely from first principles, if sub-clustering is possible. We take advantage of the noise in scRNA-seq data and evaluate the signal as a perturbation to the noise in order to pinpoint the transition to meaningful clusterability. The measure ranges from 0 to 1, where 0 indicates that the cluster consists of variances that are only due to random fluctuations and can thus be interpreted as a pure population. High clusterability (indicated by SIGMA close to 1) means that the variances within the cluster are very different from random noise and thus merit sub-clustering.
We have tested the method on simulated and real data sets and discovered previously overlooked clusters. For example, in a fetal kidney data set that we previously published, we discovered a cluster containing only 68 cells! We also found variations in interstitial cells that differ in their spatial distribution on the tissue.
We think this is a very useful method for the scRNA-seq community to make unsupervised clustering more robust, and it brings renewed awareness to random noise as a factor setting hard limits on clustering.