Single-cell RNA-sequencing (scRNA-seq) technology is a powerful tool to study organism from a single cell perspective and explore the heterogeneity between cells. Clustering is a fundamental step in scRNA-seq data analysis and it is the key to understand cell function and constitutes the basis of other advanced analysis. Nonnegative Matrix Factorization (NMF) has been widely used in clustering analysis of transcriptome data and achieved good performance. However, the existing NMF model is unsupervised and ignores known gene functions in the process of clustering. Knowledges of cell markers genes (genes that only express in specific cells) in human and model organisms have been accumulated a lot, such as the Molecular Signatures Database (MSigDB), which can be used as prior information in the clustering analysis of scRNA-seq data. Because the same kind of cells is likely to have similar biological functions and specific gene expression patterns, the marker genes of cells can be utilized as prior knowledge in the clustering analysis.
Researchers from the People’s Hospital of Longhua District propose a robust and semi-supervised NMF (rssNMF) model, which introduces a new variable to absorb noises of data and incorporates marker genes as prior information into a graph regularization term. They use rssNMF to solve the clustering problem of scRNA-seq data.
An illustration of how to construct the weight matrix Q (C)
(A) The heatmap is an ideal simulated gene expression matrix X; (B) the heatmap is part of the matrix X that only selects rows of marker genes.
Twelve scRNA-seq datasets with true labels are used to test the model performance and the results illustrate that this model outperforms original NMF and other common methods such as KMeans and Hierarchical Clustering. Biological significance analysis shows that rssNMF can identify key subclasses and latent biological processes.
Availability – R code of NMF and NMF related variants are available as a Supplemental File.