The research landscape of single-cell and single-nuclei RNA-sequencing is evolving rapidly. In particular, the area for the detection of rare cells was highly facilitated by this technology. However, an automated, unbiased, and accurate annotation of rare subpopulations is challenging. Once rare cells are identified in one dataset, it is usually necessary to generate further specific datasets to enrich the analysis (e.g., with samples from other tissues). From a machine learning perspective, the challenge arises from the fact that rare-cell subpopulations constitute an imbalanced classification problem. University of Rostock researchers introduce a Machine Learning (ML)-based oversampling method that uses gene expression counts of already identified rare cells as an input to generate synthetic cells to then identify similar (rare) cells in other publicly available experiments. They utilize single-cell synthetic oversampling (sc-SynO), which is based on the Localized Random Affine Shadowsampling (LoRAS) algorithm. The algorithm corrects for the overall imbalance ratio of the minority and majority class.
The researchers demonstrate the effectiveness of their method for three independent use cases, each consisting of already published datasets. The first use case identifies cardiac glial cells in snRNA-Seq data (17 nuclei out of 8635). This use case was designed to take a larger imbalance ratio (~1 to 500) into account and only uses single-nuclei data. The second use case was designed to jointly use snRNA-Seq data and scRNA-Seq on a lower imbalance ratio (~1 to 26) for the training step to likewise investigate the potential of the algorithm to consider both single-cell capture procedures and the impact of “less” rare-cell types. The third dataset refers to the murine data of the Allen Brain Atlas, including more than 1 million cells. For validation purposes only, all datasets have also been analyzed traditionally using common data analysis approaches, such as the Seurat workflow.
Visualization of the workflow, demonstrating a step-by-step explanation
for a sc-SynO analysis
a Several or one snRNA-Seq or scRNA-Seq fastq datasets can be used as an input. Here, we identify our cell population of interest and provide raw or normalized read counts of this specific population. This can be done with any single-cell analysis workflow, e.g., Seurat. b Further information are extracted for cluster annotation that serve as improved input for the subsequent training with sc-SynO. c Based on the data input, we utilize the underlying LoRAS synthetic oversampling algorithm of sc-SynO to generate new cells around the former origin of cells to increase the size of the minority sample. d The trained Machine Learning classifier is validated on the trained, pre-annotated dataset to evaluate the performance metrics of the actual model. The sc-SynO model is now ready to identify the learned rare-cell type in novel data. This figure was solely created by the authors
In comparison to baseline testing without oversampling, this approach identifies rare-cells with a robust precision-recall balance, including a high accuracy and low false positive detection rate. A practical benefit of this algorithm is that it can be readily implemented in other and existing workflows. The code is publicly available and can easily be transferred to identify other rare-cell types.
Availability – All computational scripts can be obtained from our FairdomHub instance (https://fairdomhub.org/assays/1368), or the algorithm itself (https://github.com/narek-davtyan/LoRAS), and the current integration for sc-SynO on GitHub (https://github.com/COSPOV/sc-SynO).