The enormous number of public gene expression data sets can turn into research gold when mined for any number of biological questions. But digging into the data of other researchers is often fraught by a lack of metadata. In particular, missing electrophoresis-based RNA quality scores make it impossible to cull poor-quality samples from analysis or to take computational steps to account for RNA degradation. Because degraded RNA samples, such as those collected from post-mortem tissues, can result in distinct expression profiles with potential biases, a particularly important step in mining these data is quality control.
Now, researchers from Tsinghua and Columbia Universities have developed a method named mRIN to directly assess mRNA integrity from RNA-Seq data at the sample and individual gene level. They systematically analyse large-scale RNA-Seq data sets of the human brain transcriptome generated by different consortia. Their analysis demonstrates that 3′ bias resulting from partial RNA fragmentation in post-mortem tissues has a marked impact on global expression profiles, and that mRIN effectively identifies samples with different levels of mRNA degradation. Unexpectedly, this process has a reproducible and gene-specific component, and transcripts with different stabilities are associated with distinct functions and structural features reminiscent of mRNA decay in living cells.
(a) Schematic illustration of the algorithm to estimate mRIN. After estimation of the 3′ bias of each gene and sample using a KS statistic from the read coverage profile, an mRIN is calculated for each sample. A normal distribution of the mRINs of non-degraded samples is estimated using a mixture model to assess the statistical significance. (b) Global under-representation of gene expression of the BrainSpan samples as measured by RNA-Seq is associated with low mRINs. Samples in the mRIN bar plot and the heat map are in the same order. (c) Validation of mRIN as a measure of mRNA integrity by a direct comparison of the RNA-Seq and exon array data. This analysis included 479 samples whose gene expression was quantified by both RNA-Seq and exon arrays. For each sample, the correlation of gene expression estimated from RNA-Seq and that estimated from exon arrays (denoted seq–array correlation or SAC) is calculated. SAC is plotted against the mRIN of each sample (Pearson correlation R=0.58, P<2.2 × 10−16, F-test). (d) mRIN was used to separate 124 samples with the most severe RNA degradation (mRIN<−0.033, P<0.1, Methods) from the remaining 355 samples. For each group, the heat maps of gene expression as measured by RNA-Seq and exon arrays are shown, with genes and samples in the same order as determined by hierarchical clustering of the array data.