RNA sequencing analysis methods are often derived by relying on hypothetical parametric models for read counts that are not likely to be precisely satisfied in practice. Methods are often tested by analyzing data that have been simulated according to the assumed model. This testing strategy can result in an overly optimistic view of the performance of an RNA-seq analysis method.
Researchers at Iowa State University have developed a data-based simulation algorithm for RNA-seq data. The vector of read counts simulated for a given experimental unit has a joint distribution that closely matches the distribution of a source RNA-seq dataset provided by the user. They conduct simulation experiments based on the negative binomial distribution and our proposed nonparametric simulation algorithm. They compare performance between the two simulation experiments over a small subset of statistical methods for RNA-seq analysis available in the literature. The researchers use as a benchmark the ability of a method to control the false discovery rate (FDR). Not surprisingly, methods based on parametric modeling assumptions seem to perform better with respect to FDR control when data are simulated from parametric models rather than using our more realistic nonparametric simulation strategy.
Availability: The nonparametric simulation algorithm developed in this paper is implemented in the R package SimSeq, which is freely available from the Comprehensive R Archive Network – http://cran.rproject.org/