Carefully designed control experiments provide a gold standard for benchmarking new platforms, protocols and pipelines in genomics research. RNA profiling control studies frequently use the mixture design, which takes two distinct samples and combines them in known proportions to induce predictable expression changes for every gene. Current mixture experiments have low noise and simulate relatively large expression changes by comparing RNA from different tissues, making them atypical of regular experiments.
To generate a more realistic RNA-sequencing control data set, researchers from the Walter and Eliza Hall Institute of Medical Research mixed two cell lines of the same cancer type in various proportions. Noise was added by independently preparing, mixing and degrading a subset of the samples. The systematic gene-expression changes induced by this design were used to benchmark different library preparation kits (standard poly-A versus total RNA with Ribozero depletion) and analysis pipelines for differential gene expression, differential splicing and deconvolution analysis. More signal for introns and various RNA classes (ncRNA, snRNA, snoRNA) and less variability after degradation was observed using the total RNA kit. For differential expression analysis, voom with quality weights marginally outperformed other popular methods, while for differential splicing, the DEXSeq method was found to be the most sensitive but also the most inconsistent. For sample deconvolution analysis, DeMix outperformed IsoPure convincingly.
RNA-sequencing control experiments such as this provide a valuable resource for benchmarking different sequencing protocols and data pre-processing workflows. These researchers have demonstrated that with a few extra steps, data with noise characteristics much more similar to regular RNA-sequencing experiments can be obtained.
Overview of experimental design and data quality of our mixture control experiment
A) Two lung cancer cell lines (NCI-H1975 and HCC827) were mixed in 3 different proportions on 3 separate occasions to simulate biological variability. The second replicate of each mixture was split in two and either processed normally or heat treated (incubated at 37oC for 9 days, see Methods) to degrade the RNA to simulate variations in RNA quality. B) Mapping statistics of reads assigned to different genomic features for all replicate 2 samples (includes both intact (labels that begin R2) and degraded (labels that begin R2D) RNA samples). The percentages that could be assigned to exons, introns, intergenic regions or were unmapped are shown in different colours. C) Mapping statistics of reads assigned to different classes of RNA for all replicate 2 samples. This figure breaks down the gene-level exon reads from panel B according to NCBI’s gene type annotation. D) Multidimensional scaling plot of poly-A RNA and total RNA experiments showing similarities and dissimilarities between libraries. Distances on the plot correspond to the leading fold-change, which is the average (root-mean-square) log2fold-change for the 500 genes most divergent between each pair of samples. Libraries are coloured by mixture proportions, where circles represent good samples and triangles represent degraded samples.