3′-end poly(A)+ sequencing is an efficient and economical method for global measurement of mRNA levels and alternative poly(A) site usage. A common method involves oligo(dT)19V reverse-transcription (RT)-based library preparation and high-throughput sequencing with a custom primer ending in (dT)19. While the majority of library products have the first sequenced nucleotide reflect the bona fide poly(A) site (pA), a substantial fraction of sequencing reads arise from various mis-priming events. These can result in incorrect pA site calls anywhere from several nucleotides downstream to several kilobases upstream from the bona fide pA site. While these mis-priming events can be mitigated by increasing annealing stringency (e.g. increasing temperature from 37 °C to 42 °C), they still persist at an appreciable level (∼10%) and computational methods must be used to prevent artifactual calls.
UCLA researchers present a bioinformatics workflow for precise mapping of poly(A)+ 3′ ends and handling of artifacts due to oligo(dT) mis-priming and sample polymorphisms. The researchers test pA site calling with three different read mapping programs (STAR, BWA, and BBMap), and show that the way in which each handles terminal mismatches and soft clipping has a substantial impact on identifying correct pA sites, with BWA requiring the least post-processing to correct artifacts. They demonstrate the use of this pipeline for mapping pA sites in the model eukaryote S. cerevisiae, and further apply this technology to non-polyadenylated transcripts by employing in vitro polyadenylation prior to library prep (IVP-seq). As proof of principle, the researchers show that a fraction of tRNAs harbor CCU 3′ tails instead of the canonical CCA tail, and globally identify 3′ ends of splicing intermediates arising from inefficiently spliced transcripts.
3′ end sequencing workflow with oligo(dT)19V
a) Overview of the 3′ end library preparation with oligo(dT)19V reverse-transcription of poly(A)+ RNA. oligo(dT)19 annealed internally in the pA tail would exhibit a mismatch between the V and A, greatly reducing the rate of RT extension. Second strand synthesis with random hexamers adds the read2 adapter. PCR amplification adds the P5/P7 adapter sequences with indexes for multiplexed sequencing. Sequencing on the flow cell proceeds with a 5′-read1-T19-3′ primer such that the first nucleotide sequenced is V (A/C/G), corresponding to the first genomically-encoded, non-adenosine ribonucleoside upstream of the pA tail. (b) Overview of the bioinformatics workflow for 3′ end sequencing analysis (single-end reads for this example). Reads are processed to remove adapters and low-quality sequence (not 5′ trimmed) and mapped to the genome, allowing for 5′ soft clipping to address oligo(dT) mis-priming in the pA tail. The mapped reads are then collapsed to their 5′ends, and reads likely arising from genomic A/(G)-rich stretches are flagged. pA site counts can then be analyzed by differential expression at the level of specific sites (nt resolution), clusters (multi-nt resolution), or gene-level features (100 bp-1 kb resolution). (c) Chart depicting the sources of potential mapping artifacts in (oligo(dT)19V) 3′ end seq, characteristics of alignments arising from each artifact, and guidelines for alignment post-processing to generate the most probable 3′ end. BWA is the recommended aligner as it requires the least post-processing.
Availability – RECTIFY, Resolving Erroneous Calls of poly(A) Tails by IdentiFYing soft clipping, is available at: http://github.com/k-roy/RECTIFY