RNA-seq is increasingly used to study gene expression of various organisms. While it provides a great opportunity to explore genome-scale transcriptional patterns with tremendous depth, it comes with prohibitive costs. Establishing a minimal sequencing depth for required accuracy will guide cost-effective experimental design and promote the routine application of RNA-seq.
To address this issue, researchers at Cornell University selected 36 RNA-seq datasets, each with more than 20 million reads from six widely-used model organisms: Saccharomyces cerevisiae, Homo sapiens, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, and Arabidopsis thaliana, and investigated statistical correlations between the sequencing depth and the outcome accuracy. To achieve this, this randomly chose reads from each dataset, mapped them to the reference genomes, and analyzed the accuracy achieved with varying coverage. Their results indicated that as low as one million reads can provide the same sequencing accuracy in transcript abundance (r=0.99) as >30 million reads for highly-expressed genes in all six species. Because many metabolically and pathologically-relevant genes are highly expressed, these findings might be instructive for cost-effective experimental designs in NGS-based research and also provide useful guidance to similar research for other organisms.