More realistic simulation techniques for RNA-seq data

With the explosion in the number of methods designed to analyze bulk and single-cell RNA-seq data, there is a growing need for approaches that assess and compare these methods. The usual technique is to compare methods on data simulated according to some theoretical model. However, as real data often exhibit violations from theoretical models, this can result in unsubstantiated claims of a method’s performance.

Rather than generate data from a theoretical model, researchers at American University have developed methods to add signal to real RNA-seq datasets. Since the resulting simulated data are not generated from an unrealistic theoretical model, they exhibit realistic (annoying) attributes of real data. This lets RNA-seq methods developers assess their procedures in non-ideal (model-violating) scenarios. These procedures may be applied to both single-cell and bulk RNA-seq. The researchers show that their simulation method results in more realistic datasets and can alter the conclusions of a differential expression analysis study. They also demonstrate their approach by comparing various factor analysis techniques on RNA-seq datasets.

Using data simulated from a theoretical model can substantially impact the results of a study.

Principal Component Plots


First and second principle components for the GTEx dataset (left), the powsimR dataset (center), and the seqgendiff dataset (right). The first and second principle components of the powsimR dataset are very different from those of the GTEx and seqgendiff datasets

Availability – the tools are available in the seqgendiff R package on the Comprehensive R Archive Network:

Gerard D. (2020) Data-based RNA-seq simulations by binomial thinning. BMC Bioinformatics 21(1):206. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.