De novo assembly of short RNA-Seq reads into transcripts is challenging due to sequence similarities in transcriptomes arising from gene duplications and alternative splicing of transcripts. A team led by researchers at the University of Washington have developed Shannon, an RNA-Seq assembler with an optimality guarantee derived from principles of information theory: Shannon reconstructs nearly all information-theoretically reconstructable transcripts. Shannon is based on a theory the research team develop for de novo RNA-Seq assembly that reveals differing abundances among transcripts to be the key, rather than the barrier, to effective assembly. The assembly problem is formulated as a sparsest-flow problem on a transcript graph, and the heart of Shannon is a novel iterative flow-decomposition algorithm. This algorithm provably solves the information-theoretically reconstructable instances in linear-time even though the general sparsest-flow problem is NP-hard. Shannon also incorporates several additional new algorithmic advances: a new error-correction algorithm based on successive cancelation, a multi-bridging algorithm that carefully utilizes read information in the k-mer de Bruijn graph, and an approximate graph partitioning algorithm to split the transcriptome de Bruijn graph into smaller components. In tests on large RNA-Seq datasets, Shannon obtains significant increases in sensitivity along with improvements in specificity in comparison to state-of-the-art assemblers.
(a) The various steps of the proposed algorithm. (b) Architecture for scaling the proposed algorithm computationally. (c) Structure of error correction algorithm (d) A two-transcript example, where without abundance it is unclear which two paths to reconstruct. (e) A four-transcript example, where the utility of abundance needs well-designed algo- rithm to find the sparsest flow. In particular, a greedy algorithm will make a mistake and reconstruct a 5-transcript output. (f) A sample run of the algorithm for selecting transcripts from the transcript graph based on sparsest flow using local moves. (g) Example of a real splicing graph from RefSeq PPIP5K1 (Kinase) Gene having 4 isoforms: NM 001190214.1 at abundance a, NM 001130859.2 at abundance b, NM 001130858.2 at abundance c and NM 014659.5 at abundance d. Here nodes Xi correspond to exons, whereas edges and edge-weights correspond to junctions and expected copy counts.
Availability – Shannon is available as an open-source software and can be downloaded from http://sreeramkannan.github.io/Shannon/