De novo transcriptome assembly is an important technique for understanding gene expression in non-model organisms. Many de novo assemblers using the de Bruijn graph of a set of the RNA sequences rely on in-memory representation of this graph. However, current methods analyse the complete set of read-derived k-mer sequence at once, resulting in the need for computer hardware with large shared memory.
A team led by researchers at STFC Daresbury Laboratory introduces a novel approach that clusters k-mers as the first step. The clusters correspond to small sets of gene products, which can be processed quickly to give candidate transcripts. They implement the clustering step using the MapReduce approach for parallelising the analysis of large datasets, which enables the use of compute clusters. The computational task is distributed across the compute system using the industry-standard MPI protocol, and no specialised hardware is required. Using this approach, the researchers have re-implemented the Inchworm module from the widely used Trinity pipeline, and tested the method in the context of the full Trinity pipeline. Validation tests on a range of real datasets show large reductions in the runtime and per-node memory requirements, when making use of a compute cluster.
Workflow summarising the MapReduce-Inchworm algorithm
The steps are described in the main text and the Additional file 1. In this figure, V represents k-mer nodes with abundances C, E represents edges with abundances CE, and Z represents zone IDs
This study shows that MapReduce-based clustering has great potential for distributing challenging sequencing problems, without loss of accuracy. Although the researchers focussed on the Trinity package, they propose that such clustering is a useful initial step for other assembly pipelines.