A team led by researchers at the Wellcome-MRC Cambridge Stem Cell Institute has developed entropy sorting (ES), a mathematical framework that distinguishes genes indicative of cell identity. ES achieves this in an unsupervised manner by quantifying if observed correlations between features are more likely to have occurred due to random chance versus a dependent relationship, without the need for any user-defined significance threshold. On synthetic data, the researchers demonstrate the removal of noisy signals to reveal a higher resolution of gene expression patterns than commonly used feature selection methods. They then apply ES to human pre-implantation embryo single-cell RNA sequencing (scRNA-seq) data. Previous studies failed to unambiguously identify early inner cell mass (ICM), suggesting that the human embryo may diverge from the mouse paradigm. In contrast, ES resolves the ICM and reveals sequential lineage bifurcations as in the classical model. ES thus provides a powerful approach for maximizing information extraction from high-dimensional datasets such as scRNA-seq data.
FFAVES and ESFW workflow
The metrics defined by ES are encoded in two algorithms. The first, FFAVES, uses ES to identify data points in a discrete matrix that are statistically likely to be displaying the wrong state. The second algorithm, ESFW, assigns an importance weight to each feature in the data. Higher weights indicate that a feature is more likely to belong to a set of dependent features, while lower weights pertain to features that are randomly expressed throughout the data. Yellow, blue, and green boxes provide the proposed workflow to apply FFAVES and ESFW to high-dimensional data for unsupervised feature selection. The purple and red boxes outline each algorithm.
Availability – https://github.com/aradley/FFAVES