Identifying differentially expressed (DE) genes from RNA sequencing (RNAseq) studies is among the most common analyses in genomics. However, RNAseq DE analysis presents several statistical and computational challenges, including over-dispersed read counts and, in some settings, sample non-independence. Previous count-based methods rely on simple hierarchical Poisson models (e.g., negative binomial) to model independent over-dispersion, but do not account for sample non-independence due to relatedness, population structure and/or hidden confounders.
Here, researchers from the University of Michigan present a Poisson mixed model with two random effects terms that account for both independent over-dispersion and sample non-independence. They also developed a scalable sampling-based inference algorithm using a latent variable representation of the Poisson distribution. With simulations, they show that their method properly controls for type I error and is generally more powerful than other widely used approaches, except in small samples (n<15) with other unfavorable properties (e.g., small effect sizes). The researchers also apply their method to three real data sets that contain related individuals, population stratification, or hidden confounders. The results show that this method increases power in all three data compared to other approaches, though the power gain is smallest in the smallest sample (n=6).
MACAU exhibits increased power to detect true positive DE genes
across a range of simulation settings
Area under the curve (AUC) is shown as a measure of performance for MACAU (red), Negative binomial (purple), Poisson (green), GEMMA (blue), and Linear (cyan). Each simulation setting consists of 10 simulation replicates, and each replicate includes 10,000 simulated genes, with 1,000 DE and 9,000 non-DE. We used n = 63, 1260 = 0.0, PVE = 0.25, ? = 0.25. In (A) we increased ? while maintaining CV = 0.3 and in (B) we increased CV while maintaining ? = 0.3. Boxplots of AUC across replicates for different methods show that (A) heritability ( ? ) influences the relative performance of the methods that account for sample non-independence (MACAU and GEMMA) compared to the methods that do not (negative binomial, Poisson, linear); (B) variation in total read counts across individuals, measured by the coefficient of variation (CV), influences the relative performance of GEMMA and negative binomial. Insets in the two figures show the rank of different methods, where the top row represents the highest rank.
Availability – This method is implemented in MACAU, freely available at www.xzlab.org/software.html.