Numerous statistical pipelines are now available for the differential analysis of gene expression measured with RNA-sequencing technology. Most of them are based on similar statistical frameworks after normalization, differing primarily in the choice of data distribution, mean and variance estimation strategy and data filtering. Researchers from the CNRS, INRA propose an evaluation of the impact of these choices when few biological replicates are available through the use of synthetic data sets. This framework is based on real data sets and allows the exploration of various scenarios differing in the proportion of non-differentially expressed genes. Hence, it provides an evaluation of the key ingredients of the differential analysis, free of the biases associated with the simulation of data using parametric models. Their results show the relevance of a proper modeling of the mean by using linear or generalized linear modeling. Once the mean is properly modeled, the impact of the other parameters on the performance of the test is much less important. Finally, they propose to use the simple visualization of the raw P-value histogram as a practical evaluation criterion of the performance of differential analysis methods on real data sets.
Impact of the key ingredients of the differential analysis
on the calculation of the raw P-values
The median of the Kolmogorov–Smirnov test statistics for the set of methods covering the various parameters of the differential analysis of RNA-seq data are shown for synthetic data sets containing 10 (0.1) to 90% (0.9) of genes from the full H0 data set. These data sets contain many (0.1) to few (0.9) DE genes. The full lines correspond to methods without a filtering step and the dashed lines to the same methods with a filtering step.