Rapidly decreasing cost of next-generation sequencing has led to the recent availability of large-scale RNA-seq data, that empowers the analysis of gene expression variability, in addition to gene expression means. Researchers from the University of Arizona present the MDSeq, based on the coefficient of dispersion, to provide robust and computationally efficient analysis of both gene expression means and variability on RNA-seq counts. The MDSeq utilizes a novel reparametrization of the negative binomial to provide flexible generalized linear models (GLMs) on both the mean and dispersion. They address challenges of analyzing large-scale RNA-seq data via several new developments to provide a comprehensive toolset that models technical excess zeros, identifies outliers efficiently, and evaluates differential expressions at biologically interesting levels.
The researchers evaluated performances of the MDSeq using simulated data when the ground truths are known. Results suggest that the MDSeq often outperforms current methods for the analysis of gene expression mean and variability. Moreover, the MDSeq is applied in two real RNA-seq studies, in which they identified functionally relevant genes and gene pathways. Specifically, the analysis of gene expression variability with the MDSeq on the GTEx human brain tissue data has identified pathways associated with common neurodegenerative disorders when gene expression means were conserved.
Powers of detecting differential expression variability
There are n samples of cases and controls each and varying proportions of excess zeros s and log2 fold-changes log2FC. The MDSeq often performs the best when sample sizes are moderate or large. Levene’s tests and heteroscedastic regression tend to deteriorate in performance with increasing proportions of excess zeros s. Results are based on 1,000 simulations without additional covariates.
Availability – The MDSeq is available in an efficient and user-friendly R package at https://github.com/zjdaye/MDSeq.