Abundant but short second-generation sequencing reads make assembly difficult, leading to fragmented genomes and gene annotations. Gene structure information from RNA sequences can be used to improve the completeness and contiguity of an assembly, but bioinformatics methods have been lacking.
Researchers at Johns Hopkins School of Medicine have developed Rascaf, a highly efficient tool leveraging long-range continuity information from intron spanning RNA sequencing (RNA-seq) read pairs to detect new contig connections. It determines a heaviest path in an exon block graph that simultaneously represents a gene and the underlying contig relationships. Rascaf is more accurate than its competitors, highly precise, and finds thousands of new verifiable connections in several draft Rosaceae genomes. Lightweight and practical, it can be readily incorporated into sequencing pipelines to improve an assembly and its gene annotations.
Overall framework of the Rascaf algorithm
Step 1: Prepare the raw assembly by splitting the scaffold-level assembly at runs of Ns. Paired-end RNA sequencing (RNA-seq) reads (red) connect four contigs (blue boxes) in the raw genome assembly. Step 2: Build the exon blocks by clustering read alignments along the genome. Step 3: Build the gene blocks by connecting exon blocks by introns extracted from spliced reads. Step 4: Build the gene block graph. Each gene block is represented by two nodes connected by a block edge (thick lines); ends of contig nodes linked by paired-end reads are then connected by mate edges (thin lines). Continuous lines represent the selected block scaffolds along the heaviest path in the gene block graph, whereas dotted lines mark unselected edges in the graph. Step 5: Given a block scaffold determined above, find a set of candidate connections between contigs underlying the gene blocks. Steps 5 and 6: Build a contig graph by aggregating connections derived from multiple RNA-seq data sets. Each contig is represented by a pair of nodes connected by a contig edge (thick lines). Additionally, contigs adjacent in a scaffold in the raw assembly, or that were part of a contig connection detected in Step 5, are linked by a scaffold edge (thin lines). Step 7: Determine a set of cycle-free paths in the contig graph, using topological sorting, and use them to guide the construction of the new scaffolds.
Availability – Rascaf is available free of charge under the GNU Public License from https://github.com/mourisl/Rascaf.