With increased availability of de novo assembly algorithms, it is feasible to study entire transcriptomes of non-model organisms. While algorithms are available that are specifically designed for performing transcriptome assembly from high-throughput sequencing data, they are very memory-intensive, limiting their applications to small data sets with few libraries.
Texas A&M University researchers develop a transcriptome assembly algorithm that recovers alternatively spliced isoforms and expression levels while utilizing as many RNA-Seq libraries as possible that contain hundreds of gigabases of data. New techniques are developed so that computations can be performed on a computing cluster with moderate amount of physical memory.
Illustration of the iterative algorithm to enumerate k-mer frequencies
For the k ′-mer \protecta1⋯ak′, its two frequency slots with zero counts for nucleotides c and t are removed to obtain (k ′+1)-mers \protecta1⋯ak′a and \protecta1⋯ak′g
This strategy minimizes memory consumption while simultaneously obtaining comparable or improved accuracy over existing algorithms. It provides support for incremental updates of assemblies when new libraries become available.
Availability – A software program that implements the algorithm is available at: http://faculty.cse.tamu.edu/shsze/asplice.