The standard analysis pipeline for single-cell RNA-seq data consists of sequential steps initiated by clustering the cells. An innate limitation of this pipeline is that an imperfect clustering result can irreversibly affect the succeeding steps. For example, there can be cell types not well distinguished by clustering because they largely share the global structure, such as the anterior primitive streak and mid primitive streak cells. If one searches differentially expressed genes (DEGs) solely based on clustering, marker genes for distinguishing these types will be missed. Moreover, clustering depends on many parameters and can often be subjective to manual decisions.
To overcome these limitations, researchers at Seoul National University have developed MarcoPolo, a method that identifies informative DEGs independently of prior clustering. MarcoPolo sorts out genes by evaluating if the distributions are bimodal, if similar expression patterns are observed in other genes, and if the expressing cells are proximal in a low-dimensional space. Using real datasets with FACS-purified cell labels, the researchers demonstrate that MarcoPolo recovers marker genes better than competing methods. Notably, MarcoPolo finds key genes that can distinguish cell types that are not distinguishable by the standard clustering. MarcoPolo is built in a convenient software package that provides analysis results in an HTML file.
Overview of MarcoPolo
(A) the standard analysis pipeline for scRNA-seq data, consisting of two consecutive steps. In this pipeline, any errors in the clustering step irreversibly affect the succeeding step. (B) MarcoPolo analysis pipeline. MarcoPolo identifies informative DEGs independent of clustering, and these genes can be utilized for various purposes to complement the standard pipeline. (C) MarcoPolo fits a single Poisson distribution and a two-component Poisson mixture model separately. In the tSNE plots, the cells were colored by the groups they belonged to. In the bottom plots, on-cells (high expression component) are colored red, and off-cells (low expression component) are colored orange. (D) MarcoPolo’s voting system prioritizes genes exhibiting a shared expression pattern with other genes. The on/off patterns of 6 different genes are plotted. Arrows indicate that a gene supports (votes for) another gene because they share expression patterns. Gene 1 and gene 3 got three votes, while gene 2 got zero vote. (E) MarcoPolo’s proximity score system calculates the variance of the principal component (PC) values of on-cells for each gene. Higher ranks are assigned to genes with low variance. (F) MarcoPolo’s bimodality score system gives higher scores to genes whose expressions follow a bimodal distribution. (G) MarcoPolo offers the analysis result in a local database (HTML file). For each gene, the web server provides the log fold change values, expression on/off plots, histograms, scores, and the biological description of the gene.
Availability – MarcoPolo is available at the GitHub repository (https://github.com/chanwkimlab/MarcoPolo).