In recent years, single-cell RNA sequencing enables us to discover cell types or even subtypes. Its increasing availability provides opportunities to identify cell populations from single-cell RNA-seq data. Computational methods have been employed to reveal the gene expression variations among multiple cell populations. Unfortunately, the existing ones can suffer from realistic restrictions such as experimental noises, numerical instability, high dimensionality, and computational scalability.
Researchers from the City Uinversity of Hong Kong propose an evolutionary multiobjective ensemble pruning algorithm (EMEP) that addresses those realistic restrictions. Their EMEP algorithm firstly applies the unsupervised dimensionality reduction to project data from the original high dimensions to low-dimensional sub-spaces; basic clustering algorithms are applied in those new subspaces to generate different clustering results to form cluster ensembles. However, most of those cluster ensembles are unnecessarily bulky with the expense of extra time costs and memory consumption. To overcome that problem, EMEP is designed to dynamically select the suitable clustering results from the ensembles. Moreover, to guide the multiobjective ensemble evolution, three cluster validity indices including the overall cluster deviation, the within-cluster compactness, and the number of basic partition clusters are formulated as the objective functions to unleash its cell type discovery performance using evolutionary multiobjective optimization. The researchers applied EMEP to 55 simulated datasets and seven real single-cell RNA-seq datasets, including six single-cell RNA-seq dataset and one large-scale dataset with 3005 cells and 4412 genes. Two case studies are also conducted to reveal mechanistic insights into the biological relevance of EMEP. They found that EMEP can achieve superior performance over the other clustering algorithms, demonstrating that EMEP can identify cell populations clearly.
The performance of EMEP and other nine clustering algorithms including Link-based Cluster Ensemble (LCE), Entropy-based Consensus Clustering (ECC), Spectral Clustering (SC), K-means Clustering (KM), clustering by fast search and find of density peaks (CDP) , t-Distributed Stochastic Neighbor Embedding (t-SNE), Single-Cell Interpretation via Multikernel Learning (SIMLR), Sparse Spectral Clustering (SSC), and Spectral clustering based on learning similarity matrix (MPSSC) on 55 simulated datasets.The performance is measured using the normalized mutual information ( NMI ) and adjusted rand index (ARI).
Availability – EMEP is written in Matlab and available at https://github.com/lixt314/EMEP.