The reproducibility of gene expression measured by RNA sequencing (RNA-Seq) is dependent on the sequencing depth. While unmapped or non-exonic reads do not contribute to gene expression quantification, duplicate reads contribute to the quantification but are not informative for reproducibility. UC Santa Cruz researchers show that mapped, exonic, non-duplicate (MEND) reads are a useful measure of reproducibility of RNA-Seq datasets used for gene expression analysis.
In bulk RNA-Seq datasets from 2,179 tumors in 48 cohorts, the fraction of reads that contribute to the reproducibility of gene expression analysis varies greatly. Unmapped reads constitute 1-77% of all reads (median [IQR], 3% [3-6%]); duplicate reads constitute 3-100% of mapped reads (median [IQR], 27% [13-43%]); and non-exonic reads constitute 4-97% of mapped, non-duplicate reads (median [IQR], 25% [16-37%]). MEND reads constitute 0-79% of total reads (median [IQR], 50% [30-61%]).
Because not all reads in an RNA-Seq dataset are informative for reproducibility of gene expression measurements and the fraction of reads that are informative varies, the researchers propose reporting a dataset’s sequencing depth in MEND reads, which definitively inform the reproducibility of gene expression, rather than total, mapped, or exonic reads. They provide a Docker image containing (i) the existing required tools (RSeQC, sambamba, and samblaster) and (ii) a custom script to calculate MEND reads from RNA-Seq data files. The researchers recommend that all RNA-Seq gene expression experiments, sensitivity studies, and depth recommendations use MEND units for sequencing depth.
RNA-Seq datasets include 4 main types of sequencing reads
A. Simplified schematic illustrating read types. The X axis (blue) is a genomic locus containing an exon. The other boxes each represent 1 sequencing read. Two of 5 reads are MEND reads. Other reads do not map to the genome (unmapped; orange border), map to a non-exonic region of the genome (non-exonic; green border), or are duplicates of other reads (duplicate; red border). The MEND reads (black) fit none of these categories and are most informative for determining the reproducibility of gene expression quantification. B. Schematic illustrating read type quantification. Bars representing uninformative reads are white with a colored border. For each informative fraction, the range and median (med.) are reported.
Availability – project home page: https://github.com/UCSC-Treehouse/mend_qc