Calculating Sample Size Estimates for RNA Sequencing Data

Given the high technical reproducibility and orders of magnitude greater resolution than microarrays, next-generation sequencing of mRNA (RNA-Seq) is quickly becoming the de facto standard for measuring levels of gene expression in biological experiments. Two important questions must be taken into consideration when designing a particular experiment:

1) how deep does one need to sequence? and,

2) how many biological replicates are necessary to observe a significant change in expression?

Upon studying the gene expression distributions from 127 RNA-Seq experiments, researchers from the Mayo Clinic, Minnesota found evidence that 91% ± 4% of all annotated genes are sequenced at a frequency of 0.1 times per million bases mapped, regardless of sample source. Based on this observation, and combining this information with other parameters such as biological variation and technical variation that they empirically estimate from their large datasets, they developed a model to estimate the statistical power needed to identify differentially expressed genes from RNA-Seq experiments.

Thier results provide a needed reference for ensuring RNA-Seq gene expression studies are conducted with the optimally sample size, power, and sequencing depth.


Availability – Both R code and an Excel worksheet are available for investigators to calculate for their own experiments in the supplemental data file. For complex queries and advanced usage, the authors have also provided an R package available via Bioconductor (

  •  Hart SN, Therneau TM, Zhang Y, Poland GA, Kocher JP. (2013) Calculating Sample Size Estimates for RNA Sequencing Data. J Comput Biol [Epub ahead of print]. [abstract]