Single-cell RNA sequencing (scRNA-seq) technology has revolutionized the genomics research by enabling the measurement of the transcriptomic profile at the level of single cells. One of the most fundamental problems in scRNA-seq data analysis is cell clustering, for which a rather large number of methods have been developed. With the increasing application of scRNA-seq in larger scale studies, people face the problem of cell clustering when the scRNA-seq data are from more than one subject. One challenge in analyzing such data is the subject-specific systematic variations: heterogeneity from multiple subjects may have a significant impact on the clustering accuracy. However, existing methods addressing such effect suffered from several limitations.
In this work, Emory University researchers develop a novel statistical method named ‘EDClust’ for scRNA-seq cell clustering when data are from multiple subjects. EDClust models the sequence read counts by a mixture of Dirichlet-Multinomial distributions, and explicitly accounts for the cell type heterogeneity, subject heterogeneity, and the clustering uncertainty. An EM-MM hybrid algorithm is derived for maximizing the data likelihood and clustering the cells. The researchers perform a series of simulation studies to evaluate the proposed method and demonstrate the outstanding performance of EDClust. Comprehensive benchmarking on four real scRNA-seq datasets with various tissue types and species demonstrates the substantial accuracy improvement of EDClust compared to the existing methods.