RNA sequencing (RNA-seq) is widely used for RNA quantification across environmental, biological and medical sciences; it enables the description of genome-wide patterns of expression and the deduction of regulatory interactions and networks. The aim of computational analyses is to achieve an accurate output, i.e. rigorous quantification of genes/transcripts to allow a reliable prediction of differential expression (DE), despite the variable levels of noise and biases present in sequencing data. The evaluation of sequencing quality and normalization are essential components of this process.
Researchers from the University of East Anglia investigate the discriminative power of existing approaches for the quality checking of mRNA-seq data and also propose additional, quantitative, quality checks. To accommodate the analysis of a nested, multi-level design using data on D. melanogaster, they incorporated the sample layout into the analysis. The researchers describe a “subsampling without replacement”-based normalization and identification of DE that accounts for the experimental design i.e. the hierarchy and amplitude of effect sizes within samples. They also evaluate the differential expression call in comparison to existing approaches. To assess the broader applicability of these methods, they applied this series of steps to a published set of H. sapiens mRNA-seq samples.
Comparison of distribution of DE obtained using the subsampling
normalization and HDE, DESeq2 and edgeR
MA plots, with x-axis showing log2 average abundances against OFC with an offset of 20 (Panel A) and FC (Panels B and C). The example shown ism for the 02HT ± rivals DE comparison. The red line indicates 0 log2 FC/OFC and the blue lines ±0.5 log2 FC/OFC. Red data points represent the genes ‘called’ differentially expressed by each of the methods. Panel A shows the results for subsampling normalization with DE calculated using the hierarchical approach, Panel B for DEseq2 and Panel C for edgeR. Panel d shows a Venn diagram identifying the number of differentially expressed genes identified by two or more methods versus uniquely by each.
The dataset-tailored methods improved sample comparability and delivered a robust prediction of subtle gene expression changes. Overall, the proposed approach offers the potential to improve key steps in the analysis of RNA-seq data by incorporating the structure and characteristics of biological experiments into the data analysis.