Single cell RNA sequencing (scRNA-seq), which promises to enable the quantitative study of biological processes at the single cell level, are now a routine part of experimental practice. One key computational challenge in analyzing scRNA-seq data is cell type annotation. A source of concern, however, is that the data analysis protocols for clustering cells suffer from low reproducibility and poorly-quantified accuracy.
Northwestern University researchers have developed a new benchmark for determining clustering accuracy that uses a dataset where independent reference annotations are generated from surface protein measurements. The researchers then systematically investigate the impact on labelling accuracy of different approaches to feature selection, of different clustering algorithms, and of different sets of parameter values. They demonstrate quantitatively the impact of feature selection and the poor performance of a widely used approach. Additionally, they show that an approach grounded on information theory can provide a generalizable, reliable, and accurate process for discarding uninformative features.
Limitations of current framework for optimizing scRNA-seq cell type classification algorithms and development of an externally validated dataset
A. Current implicit frameworks for optimizing scRNA-seq classification algorithms assume that some algorithm, typically Louvain method from Seurat, yields a ground-truth classification against which the accuracy of other algorithms is then determined. B. A reproducible, objective framework would make use of an independently-obtained, robust, and reproducible independently-generated reference annotations against which the accuracy of scRNA-seq classification algorithms can be objectively determined. We believe that in order to avoid circularity reference annotations should be based on independent approaches such as surface protein expression or immunostaining. C. Creation of externally validated labelling for peripheral blood mononuclear cells (PBMCs) from a healthy donor released by 10x Genomics.
Availability – The source code is publicly available at https://github.com/amarallab/Benchmark_scRNA_seq