RNA-sequencing (RNA-seq) has rapidly become a popular tool to characterize transcriptomes. A fundamental research problem in many RNA-seq studies is the identification of reliable molecular markers that show differential expression between distinct sample groups. Together with the growing popularity of RNA-seq, a number of data analysis methods and pipelines have already been developed for this task. Currently, however, there is no clear consensus about the best practices yet, which makes the choice of an appropriate method a daunting task especially for a basic user without a strong statistical or computational background.
To assist the choice, researchers from the University of Turku, Finland perform here a systematic comparison of eight widely used software packages and pipelines for detecting differential expression between sample groups in a practical research setting and provide general guidelines for choosing a robust pipeline. In general, these results demonstrate how the data analysis tool utilized can markedly affect the outcome of the data analysis, highlighting the importance of this choice.
Software packages for detecting differential expression
|Method||Version||Reference||Normalizationa||Read count distribution assumption||Differential expression test|
|edgeR||3.0.8||||TMM/Upper quartile/RLE (DESeq-like)/None (all scaling factors are set to be one)||Negative binomial distribution||Exact test|
|DESeq||1.10.1||||DESeq sizeFactors||Negative binomial distribution||Exact test|
|baySeq||1.12.0||||Scaling factors (quantile/TMM/total)||Negative binomial distribution||Assesses the posterior probabilities of models for differentially and non-differentially expressed genes via empirical Bayesian methods and then compares these posterior likelihoods|
|NOIseq||1.1.4||||RPKM/TMM/Upper quartile||Nonparametric method||Contrasts fold changes and absolute differences within a condition to determine the null distribution and then compares the observed differences to this null|
|SAMseq (samr)||2.0||||SAMseq specialized method based on the mean read count over the null features of the data set||Nonparametric method||Wilcoxon rank statistic and a resampling strategy|
|Limma||3.14.4||||TMM||voom transformation of counts||Empirical Bayes method|
|Cuffdiff 2 (Cufflinks)||2.0.2-beta||||Geometric (DESeq-like)/quartile/classic-fpkm||Beta negative binomial distribution||t-test|
|EBSeq||1.1.7||||DESeq median normalization||Negative binomial distribution||Evaluates the posterior probability of differentially and non-differentially expressed entities (genes or isoforms) via empirical Bayesian methods|
↵aIn case of availability of several normalization methods, the default one is underlined.