Long-read RNA sequencing is essential to produce accurate and exhaustive annotation of eukaryotic genomes. Despite advancements in throughput and accuracy, achieving reliable end-to-end identification of RNA transcripts remains a challenge for long-read sequencing methods. To address this limitation, a team led by researchers at the Centre for Genomic Regulation (CRG) developed CapTrap-seq, a cDNA library preparation method, which combines the Cap-trapping strategy with oligo(dT) priming to detect 5’capped, full-length transcripts, together with the data processing pipeline LyRic. The researchers benchmarked CapTrap-seq and other popular RNA-seq library preparation protocols in a number of human tissues using both ONT and PacBio sequencing. To assess the accuracy of the transcript models produced, they introduced a capping strategy for synthetic RNA spike-in sequences that mimics the natural 5’cap formation in RNA spike-in molecules. They found that the vast majority (up to 90%) of transcript models that LyRic derives from CapTrap-seq reads are full-length. This makes it possible to produce highly accurate annotations with minimal human intervention.
(A) CapTrap-seq experimental workflow. Gray boxes highlight the four main steps of full-length (FL) cDNA library construction: Anchored dT Poly(A)+, CAP-trapping, CAP and Poly(A) dependent Linker Ligation, and FL-cDNA library enrichment as described in the text. (B) The framework of the LyRic pipeline. The standard long-read data analysis process using LyRic includes five main steps. The read alignment (1) step LyRic maps the long and short (if available) RNA-seq reads to the reference genome using Minimap2 and STAR, respectively. Next, in the alignment processing step (2) the High-Confidence Genome Mappings (HCGMs) and HiSeq Supported reads are identified to be stranded based on their splice sites and poly(A) tail orientation (3). Stranded alignments with compatible intron chains are merged by tmerge into non-redundant transcript models in the build TMs step (4). Finally, LyRic evaluates the transcript end completeness using the clipped polyA tail at the 3’-end of the transcript and the support of external CAGE data (provided by the user) to assess the 5’-end completeness (5). LyRic offers some optional steps to customize the data analysis workflow. By providing a set of non-overlapping capture-targeted regions for each sample in a standard GTF format, LyRic groups features into target types, performs the analysis and generates summary statistics for targeted regions. LyRic offers some additional output including diagnostic plots and UCSC Track Hub.
CapTrap-seq has played a pivotal role in generating comprehensive transcriptome data for reference gene and transcript annotation in both the human and mouse genomes. In fact, thousands of CapTrap-seq transcript models have already been incorporated into the GENCODE annotations.