The advancement of the next-generation sequencing technology enables mapping gene expression at the single-cell level, capable of tracking cell heterogeneity and determination of cell subpopulations using single-cell RNA sequencing (scRNA-seq). Unlike the objectives of conventional RNA-seq where differential expression analysis is the integral component, the most important goal of scRNA-seq is to identify highly variable genes across a population of cells, to account for the discrete nature of single-cell gene expression and uniqueness of sequencing library preparation protocol for single-cell sequencing. However, there is lack of generic expression variation model for different scRNA-seq data sets. Hence, the objective of this study is to develop a gene expression variation model (GEVM), utilizing the relationship between coefficient of variation (CV) and average expression level to address the over-dispersion of single-cell data, and its corresponding statistical significance to quantify the variably expressed genes (VEGs).
Researchers from the University of Texas Health Science Center at San Antonio have built a simulation framework that generated scRNA-seq data with different number of cells, model parameters, and variation levels. They implemented their GEVM and demonstrated the robustness by using a set of simulated scRNA-seq data under different conditions. They evaluated the regression robustness using root-mean-square error (RMSE) and assessed the parameter estimation process by varying initial model parameters that deviated from homogeneous cell population. They also applied the GEVM on real scRNA-seq data to test the performance under distinct cases.
Workflow of identifying significantly variably expressed genes and the following analyses for single-cell RNA-seq data
Availability – The R scripts of the algorithm and the UMI data for GSE65525 will be available from GitHub, https://github.com/hillas/scVEGs.