The quality of RNA sequencing data relies on specific priming by the primer used for reverse transcription (RT-primer). Non-specific annealing of the RT-primer to the RNA template can generate reads with incorrect cDNA ends and can cause misinterpretation of data (RT mispriming). This kind of artifact in RNA-seq based technologies is underappreciated and currently no adequate tools exist to computationally remove them from published datasets. University of Texas researchers show that mispriming can occur with as little as 2 bases of complementarity at the 3′ end of the primer followed by intermittent regions of complementarity. They also provide a computational pipeline that identifies cDNA reads produced from RT mispriming, allowing users to filter them out from any aligned dataset. Using this analysis pipeline, the researchers identify thousands of mispriming events in a dozen published datasets from diverse technologies including short RNA-seq, total/mRNA-seq, HITS-CLIP and GRO-seq. They further show how RT mispriming can lead to misinterpretation of data. In addition to providing a solution to computationally remove RT-misprimed reads, they also propose an experimental solution to completely avoid RT-mispriming by performing RNA-seq using thermostable group II intron derived reverse transcriptase (TGIRT-seq).
(A) Schematic comparing bonafide cDNA peaks with peaksfrom mispriming events. RNA molecules that are properly ligated and reverse transcribed from specific RT primer-3’adapter interaction produce a pile-up of cDNA reads that have staggeredends (left). On the other hand, when RT-primer pairswith a sequence similar to the 3’adapter present within an RNA molecule, cDNA peaks with flush ends next to the priming site are produced. (B) Pipeline to identify sites of mispriming.
Availability – Python scripts used to identify RT mispriming events are available on github (https://github.com/haridh/RT-mispriming)