Transcription of eukaryotic genomes involves complex alternative processing of RNAs. Sequencing of full-length RNAs using long reads reveals the true complexity of processing. However, the relatively high error rates of long-read sequencing technologies can reduce the accuracy of intron identification. University of Dundee researchers apply alignment metrics and machine-learning-derived sequence information to filter spurious splice junctions from long-read alignments and use the remaining junctions to guide realignment in a two-pass approach. This method improves the accuracy of spliced alignment and transcriptome assembly for species both with and without existing high-quality annotations.
Junction metrics can identify genuine splice junctions
a Outline of the two-pass method. b The JAD metric can discriminate between annotated and unannotated splice junctions in simulated nanopore DRS reads. Inverse cumulative density plot showing the distribution of per-splice junction maximum JAD values for annotated (blue) and unannotated (orange) splice junctions. c Flowchart visualization of the first decision tree model. Nodes (decisions) and leaves (outcomes) are colored based on the relative ratio of real and spurious splice junctions. d Confusion matrix showing the ratios of correct and incorrect predictions of the first decision tree model on splice junctions extracted from simulated Arabidopsis read alignments
Availability – The software package 2passtools is available at: https://github.com/bartongroup/2passtools