While RNA-sequencing represents a well-established technology, the required sequencing depth for detecting all expressed genes is not known. If we leave the entire biological overhead and meta-information behind we are dealing with a classical sampling process. Such sampling processes are well known from population genetics and thoroughly investigated.
In this study, researchers from the University of Vienna and Medical University of Vienna, Austria have used the Pitman Sampling Formula to model the sampling process of RNA-sequencing. By doing so they characterize the sampling by means of two parameters which grasp the conglomerate of different sequencing technologies, protocols and their associated biases. They differ between two levels of sampling: number of reads per gene and respectively, number of reads starting at each position of a specific gene. The latter approach allows one to evaluate the theoretical expectation of uniform coverage and the performance of sequencing protocols in that respect. Most importantly, given a pilot sequencing experiment they provide an estimate for the size of the underlying sampling universe and, based on these findings, evaluate an estimator for the number of newly detected genes when sequencing an additional sample of arbitrary size.
- Tauber S, von Haeseler A. (2013) Exploring the sampling universe of RNA-seq. Stat Appl Genet Mol Biol [Epub ahead of print]. [abstract]