Researchers from the Ulsan National Institute of Science and Technology show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. They demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not.

**Effect of gene dispersion on the read count bias**

* a For a given fold-change (f = 1.3, 2, 4-fold) and a dispersion value (alpha = 0, 0.001, 0.01, 0.1 and 0.3), SNR for each read count (μ _{1}) was depicted based on the equation (1). b SNR distributions of simulated genes for different dispersion values (alpha). Mean read counts were sampled from a high depth dataset (TCGA KIRC)*

Yoon S, Nam D. (2017) **Gene dispersion is the key determinant of the read count bias in differential expression analysis of RNA-seq data**. *BMC Genomics* 18(1):408.[article]

When the sample size is small or the expression levels of a gene are highly dispersed, the NB regression shows inflated Type-I error rates but the Classical logistic and Bayes logistic (BL) regressions are conservative. Firth’s logistic (FL) regression performs well or is slightly conservative. Large sample size and low dispersion generally make Type-I error rates of all methods close to nominal alpha levels of 0.05 and 0.01. However, Type-I error rates are controlled after applying the data adaptive method. The NB, BL, and FL regressions gain increased power with large sample size, large log2 fold-change, and low dispersion. The FL regression has comparable power to NB regression.

**Empirical power of covariate models**

*Empirical power of covariate models from balanced design with N _{ D=1 } = 10 and μ _{ D=0 } = 1000. The power of Negative Binomial with true dispersion (NB), and Firth’s Logistic (FL) regressions at significance level 0.05 and 0.01 is shown in the figure. Black dotted horizontal lines represent 95 and 90% power. The odds ratios between covariates and case–control status (CovOR = 1.2 and 5) are partitioned by vertical black dotted lines. The number covariates (0, 1, 2, 3, 5 (, and 10)) in the model are positioned within each CovOR. Dotted lines within each symbol represent the 95% confidence interval. a Balanced design from N _{ D=1 } = 10, μ _{ D=0 } = 1000, dispersion = 0.01, and log2fc = 0.3. b Balanced design of N _{ D=1 } = 25, μ _{ D=0 } = 1000, dispersion = 1, and log2fc = 2*

The researchers conclude that implementing the data adaptive method appropriately controls Type-I error rates in RNA-Seq analysis. Firth’s logistic regression provides a concise statistical inference process and reduces spurious associations from inaccurately estimated dispersion parameters in the negative binomial framework.

Choi SH, Labadorf AT, Myers RH, Lunetta KL, Dupuis J, DeStefano AL. (2017) **Evaluation of logistic regression models and effect of covariates for case-control study in RNA-Seq analysis**. *BMC Bioinformatics* 18(1):91. [article]