apeglm – approximate posterior estimation for general linear model

In RNA-seq differential expression analysis, investigators aim to detect those genes with changes in expression level across conditions, despite technical and biological variability in the observations. A common task is to accurately estimate the effect size, often in terms of a logarithmic fold change (LFC).

When the read counts are low or highly variable, the maximum likelihood estimates for the LFCs has high variance, leading to large estimates not representative of true differences, and poor ranking of genes by effect size. One approach is to introduce filtering thresholds and pseudocounts to exclude or moderate estimated LFCs. Filtering may result in a loss of genes from the analysis with true differences in expression, while pseudocounts provide a limited solution that must be adapted per dataset.

Researchers from the University of North Carolina-Chapel Hill propose the use of a heavy-tailed Cauchy prior distribution for effect sizes, which avoids the use of filter thresholds or pseudocounts. The proposed method, Approximate Posterior Estimation for GLM, apeglm, has lower bias than previously proposed shrinkage estimators, while still reducing variance for those genes with little information for statistical inference.

Effects of estimating A when counts are very low

rna-seq

On the x-axis is the true value of A across simulations, and on the y-axis is the estimate ˆ A using the method implemented in apeglm .  The boxes denote the result of 10 iterations, and the blue dot denotes the true value A .  In the Methods, the estimated standard errors are treated as known in order to estimate the scale of a Normal distribution from which the true LFCs may have originated.  We simulated n=5 vs 5 Poisson counts,  with a baseline rate λ either set to 1,  5,  or 10.  We then simulated a LFC between  the  two  groups,  and  1,000  such  “genes”  were  simulated.   The  LFC  was  drawn  from  a Normal  distribution  with  mean  0  and  variance A ,  with A ∈ [0 . 25 , 0 . 5 , 0 . 75 , 1].   Our  estimate ˆ A using the formula proposed by Efron and Morris [1] assuming known variances had negative bias for λ = 1 (although it roughly tracks the trend), but performed well for λ = 5 or λ = 10, where it only slightly underestimated A . 

Availability: The apeglm package is available as an R/Bioconductor package at https://bioconductor.org/packages/apeglm, and the methods can be called from within the DESeq2 software.

Zhu A, Ibrahim JG, Love MI. (2018) Heavy-tailed prior distributions for sequence count data: removing the noise and preserving large differences. Bioinformatics [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.