The availability of fast alignment-free algorithms has greatly reduced the computational burden of RNA-seq processing, especially for relatively poorly assembled genomes. Using these approaches, previous RNA-seq datasets could potentially be processed and integrated with newly sequenced libraries. Confounding factors in such integration include sequencing depth and methods of RNA extraction and selection. Different selection methods (typically, either polyA-selection or rRNA-depletion) omit different RNAs, resulting in different fractions of the transcriptome being sequenced. In particular, rRNA-depleted libraries sample a broader fraction of the transcriptome than polyA-selected libraries. In this study, researchers from the Roslin Institute aimed to develop a systematic means of accounting for library type that allows data from these two methods to be compared.
The method was developed by comparing two RNA-seq datasets from ovine macrophages, identical except for RNA selection method. Gene-level expression estimates were obtained using a two-part process centred on the high-speed transcript quantification tool Kallisto. Firstly, a set of reference transcripts was defined that constitute a standardised RNA space, with expression from both datasets quantified against it. Secondly, a simple ratio-based correction was applied to the rRNA-depleted estimates. The outcome is an almost perfect correlation between gene expression estimates, independent of library type and across the full range of levels of expression.
Variance in TPM estimates introduced by the differential transcriptome sampling of polyA+ and ribo-minus methods can be corrected mathematically
(Left) Differential transcriptome sampling by polyA+ and ribo-minus RNA selection methods leads to variance in TPM estimates. Each point is a gene, coloured by type: black points represent protein-coding genes, pseudogenes and processed pseudogenes (n = 21,211); blue points represent RNA genes (n = 5843). The line y = x is shown in red. (Right) The same data is shown except that all ribo-minus TPM estimates were multiplied by the ratio of the median TPM across all polyA+ libraries to the median TPM across all ribo-minus libraries. Should the median TPM across all ribo-minus libraries be 0, this ratio was considered 0 also. Each point is a gene, coloured by type: black points represent protein-coding genes, pseudogenes and processed pseudogenes; blue points represent RNA genes. The line y = x is shown in red. Pearson’s r = 0.998, p < 2.2 × 10−16
A combination of reference transcriptome filtering and a ratio-based correction can create equivalent expression profiles from both polyA-selected and rRNA-depleted libraries. This approach will allow meta-analysis and integration of existing RNA-seq data into transcriptional atlas projects.