Gene expression analysis – the normal data distribution assumption may not be the correct one

A team led by researchers at the National Heart Lung and Blood Institute sequenced over 700 individuals from the Drosophila Genetic Reference Panel with the goal of identifying the optimal analysis approach for the detection of differential gene expression among single flies. The research team evaluated three different filtering strategies, eight normalization methods, and two statistical approaches using this data set.

They found that the most critical considerations for the analysis of RNA-Seq read count data were the normalization method, underlying data distribution assumption, and numbers of biological replicates, an observation consistent with previous RNA-Seq and microarray analysis comparisons. Some common normalization methods, such as Total Count, Quantile, and RPKM normalization, did not align the data across samples. Furthermore, analyses using the Median, Quantile, and Trimmed Mean of M-values normalization methods were sensitive to the removal of low-expressed genes from the data set. Although it is robust in many types of analysis, the normal data distribution assumption produced results vastly different than the negative binomial distribution. In addition, at least three biological replicates per condition were required in order to have sufficient statistical power to detect expression differences among the three-way interaction of genotype, environment, and sex.

Examples of differences observed in normalization methods.


a Boxplots of individual RAL-320 males of Environment 2. b Boxplots of the coefficient of variation for RAL-900 females of Environment 3. c Boxplots of the coefficient of variation for RAL-900 males of Environment 3.

The research team concluded that the best analysis approach to their data was to normalize the read counts using the DESeq method and apply a generalized linear model assuming a negative binomial distribution using either edgeR or DESeq software. Genes having very low read counts were removed after normalizing the data and fitting it to the negative binomial distribution.

Lin Y, Golovnina K, Chen ZX, Lee HN, Negron YL, Sultana H, Oliver B, Harbison ST. (2016) Comparison of normalization and differential expression analyses using RNA-Seq data from 726 individual Drosophila melanogaster. BMC Genomics 17(1):28. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.