The rapid increase in High-throughput sequencing of RNA (RNA-seq) has led to tremendous improvements in the detection and reconstruction of both expressed coding and non-coding RNA transcripts. Yet, the complete and accurate annotation of the complex transcriptional output of not only the human genome has remained elusive. One of the critical bottlenecks in this endeavor is the computational reconstruction of transcript structures, due to high noise levels, technological limits, and other biases in the raw data.
University of Leipzig researchers introduce several new and improved algorithms in a novel workflow for transcript assembly and quantification. They propose an extension of the common splice graph framework that combines aspects of overlap and bin graphs and makes it possible to efficiently use both multi-splice and paired-end information to the fullest extent. Phasing information of reads is used to further resolve loci. The decomposition of read coverage patterns is modeled as a minimum-cost flow problem to account for the unavoidable non-uniformities of RNA-seq data.
Its performance compares favorably with state of the art methods on both simulated and real-life datasets. Ryūtō calls 1-4% more true transcripts, while calling 5-35% less false predictions compared to the next best competitor.
Overview of the most important steps of the transcript assembly pipeline
In (a) an example set of transcripts over the same gene is shown, together with potential paired-end reads leading to them. Reads have the same length r, but varying insert lengths. Be |X| the number of basepairs in exon X, then: |A|,|C|,|D|,|E|,|F|,|G|<r and |C|+|D|<r. All other conditions are >r. The commonly used split graph (b) removes all evidence from reads spanning more than two exons and paired-end information. Instead, an exon-bin based subset — adding edges between bins when a bin is a subset of another bin — and overlap — adding edges when a bin’s suffix is prefix of other bin — graph (c) is created (trivial subsets omitted). After removing transitive components, a directed acyclic splice graph is created that allows each bin to be uniquely mapped to a set of edges, but minimizes the number of ambiguous connections (d). Any maximum- or minimum-flow implementation can be used to establish transcript expressions on each edge. Due to flow conservation, composite paths and tree nodes can be reduced without loss of information. Flow decomposition into final transcripts can be improved by using evidences (e) from multi-exon-spanning single reads or paired-end data (dashed edge-links, color-code by (a))
Availability – Ryūtō is implemented in C++ and available at https://github.com/studla/RYUTO for download.