The transcriptome of single cells can reveal important information about cellular states and heterogeneity within populations of cells. Recently, single-cell RNA-sequencing has facilitated expression profiling of large numbers of single cells in parallel. To fully exploit these data, it is critical that suitable computational approaches are developed. One key challenge, especially pertinent when considering dividing populations of cells, is to understand the cell-cycle stage of each captured cell.
Here, researchers from the Wellcome Trust Sanger Institute describe and compare five established supervised machine learning methods and a custom-built predictor for allocating cells to their cell-cycle stage on the basis of their transcriptome. In particular, they assess the impact of different normalization strategies and the usage of prior knowledge on the predictive power of the classifiers.
Overview of the approach. The transcriptional profile of individual cells as input and extracts information on cell cycle markers (left). The expression profiles of these genes are then extracted from a training dataset and used to train a prediction algorithm (top) that can be used to predict the cell cycle stage of individual cells in independent datasets.
The researchers tested the methods on previously published datasets and found that a PCA-based approach and the custom predictor performed best. Moreover, their analysis shows that the performance depends strongly on normalization and the usage of prior knowledge. Only by leveraging prior knowledge in form of cell-cycle annotated genes and by preprocessing the data using a rank-based normalization, is it possible to robustly capture the transcriptional cell-cycle signature across different cell types, organisms and experimental protocols.
Validation on data with known cell-cycle phase. a-c, F1 scores from internal cross validation for different gene sets; F1 score for G1 phase is shown in green, for S-phase in orange and for G2M phase in blue. Red lines represent the macro-averaged F1 score. A, all variable genes, B, all annotated cell-cycle genes, C, all variable cell-cycle genes. D-F, F1 scores on independent test set. D, all variable genes, E, all annotated cell-cycle genes, F, all variable cell-cycle genes.
Availability – The code for the implementation of the six cell-cycle predictors will be made available on GitHub.