Massively parallel cDNA sequencing (RNA-seq) experiments are gradually superseding microarrays in quantitative gene expression profiling. However, many biologists are uncertain about the validity of cost-saving sample pooling strategies for their RNA-seq experiments.
Researchers from Aarhus University sequenced RNA-pools and compared their results with sequencing corresponding individual RNA samples.
Agreement between sequencing RNA-pools and sequencing corresponding individual RNA samples. a Intersection between differentially expressed genes (DEGs), detected by edgeR, in RNA-seq data from pooled RNA (3 samples/ pool; two pools/ group) and of data from corresponding individual samples of RNA (3 samples/group). Rectangle represents all expressed genes. b Correlation between the logarithmic (base 2) fold changes (LFC) in expression that were estimated by sequencing RNA-pools (3 samples/ pool) and by sequencing corresponding individual samples (3 samples/group). c Intersection between the DEGs, detected by edgeR, in RNA-seq data from pooled RNA (8 samples/ pool; two pools/ group) and of data from corresponding individual samples of RNA (8 samples/group). Rectangle represents all expressed genes. d Correlation between the LFC in expression that were estimated by sequencing RNA-pools (8 samples/pool) and by sequencing corresponding individual samples (8 samples/ group)
Their results indicated limited utility of sample pooling strategies for RNA-seq in similar setups and supported increasing the number of biological replicate samples. Pooled samples do not represent the population variations in gene expression levels, and they cannot estimate within population variation. Within-group variances of the pooled samples are less than true within-group variances of the individual samples. This leads to erroneously long DEG lists with low positive predictive values that limit practical use. If researchers plan RNA pooling because of saving costs or of limited starting material, stringent false discovery corrections and high-throughput validation of as many identified DEGs as possible should be considered. If the validation targets are chosen by random sampling from the list of identified DEGs, false discovery rates can be estimated cost-effectively. An increase in the number of biological replicates, added into each pool, may help to minimise the pooling bias in estimating differential gene expression. Increasing the number of replicates is more effective to improve the power to detect DEGs than increasing sequencing depth above 10 million reads per sample. Limiting sequencing depth to 10 million reads per sample can reduce the costs and can help the biologists to sequence more replicates. Heterogeneity of biological variance among RNA samples may be larger than the dispersion, estimated by edgeR, and most contemporary RNA-seq experiments have been estimated to be under-powered by their design. Hence, reducing the number of replicates by pooling will decrease the power and the ability to estimate within population variation further, and will increase pooling bias as well as false discovery rates (FDR).