In differential expression analysis of RNA-sequencing (RNA-seq) read count data for two sample groups, it is known that highly expressed genes (or longer genes) are more likely to be differentially expressed which is called read count bias (or gene length bias). This bias had great effect on the downstream Gene Ontology over-representation analysis. However, such a bias has not been systematically analyzed for different replicate types of RNA-seq data.
Researchers from the Ulsan National Institute of Science and Technology show that the dispersion coefficient of a gene in the negative binomial modeling of read counts is the critical determinant of the read count bias (and gene length bias) by mathematical inference and tests for a number of simulated and real RNA-seq datasets. They demonstrate that the read count bias is mostly confined to data with small gene dispersions (e.g., technical replicates and some of genetically identical replicates such as cell lines or inbred animals), and many biological replicate data from unrelated samples do not suffer from such a bias except for genes with some small counts. It is also shown that the sample-permuting GSEA method yields a considerable number of false positives caused by the read count bias, while the preranked method does not.
Effect of gene dispersion on the read count bias
a For a given fold-change (f = 1.3, 2, 4-fold) and a dispersion value (alpha = 0, 0.001, 0.01, 0.1 and 0.3), SNR for each read count (μ 1) was depicted based on the equation (1). b SNR distributions of simulated genes for different dispersion values (alpha). Mean read counts were sampled from a high depth dataset (TCGA KIRC)