Current transcriptome annotations have largely relied on short read lengths intrinsic to the most widely used high-throughput cDNA sequencing technologies. For example, in the annotation of the Caenorhabditis elegans transcriptome, more than half of the transcript isoforms lack full-length support and instead rely on inference from short reads that do not span the full length of the isoform. We applied nanopore-based direct RNA sequencing to characterize the developmental polyadenylated transcriptome of C. elegans.
Taking advantage of long reads spanning the full length of mRNA transcripts, Johns Hopkins University researchers provide support for 23,865 splice isoforms across 14,611 genes, without the need for computational reconstruction of gene models. Of the isoforms identified, 3452 are novel splice isoforms not present in the WormBase WS265 annotation. Furthermore, they identified 16,342 isoforms in the 3′ untranslated region (3′ UTR), 2640 of which are novel and do not fall within 10 bp of existing 3′-UTR data sets and annotations. Combining 3′ UTRs and splice isoforms, they identified 28,858 full-length transcript isoforms. The researchers also determined that poly(A) tail lengths of transcripts vary across development, as do the strengths of previously reported correlations between poly(A) tail length and expression level, and poly(A) tail length and 3′-UTR length. Finally, they have formatted this data as a publicly accessible track hub, enabling researchers to explore this data set easily in a genome browser.
Overview of approach and sequencing of full-length isoforms
(A) Diagram of the C. elegans life cycle. (B) Plot of normalized coverage across the average coding gene with full-length (green), non-full-length (blue), and all reads (red) considered. (C) Percentage of reads that passed filtering and were called full-length in each stage. (D) Example locus showing reads aligning to the WBGene00022369 locus (black). (E) Comparison of length distributions of isoforms present in the WormBase WS265 annotation and splice isoforms identified by this study displayed as a density plot (top) and as the fold change of the densities (bottom). (F) As in E, comparison of length distribution of isoforms assembled by StringTie2 using Illumina-based RNA-seq from across C elegans development and splice isoforms identified by this study. (G) Schematic defining “full-length isoform” as a combination of splice isoform and 3′-UTR isoform. (H) Number of splice, 3′-UTR, and full-length isoforms observed across all stages. (yAd) Young adult, (mAd) mature adult. Exact numbers can be found in Supplemental Table 3.