As sequencing technologies progress, the amount of data produced grows exponentially, shifting the bottleneck of discovery towards the data analysis phase. In particular, currently available mapping solutions for RNA-seq leave room for improvement in terms of sensitivity and performance, hindering an efficient analysis of transcriptomes by massive sequencing.
Here, a team led by researchers at the University of Cambridge propose an innovative solution for high-quality mapping of both short and long reads, based on a combination of mapping with Burrows-Wheeler Transform and local alignment with Smith-Waterman (SW), that drastically increases mapping accuracy (95 versus 60–85% by current mappers, in the most common scenarios) and substantially reduces runtimes (by about 15× when compared with TopHat2, 8× with MapSplice and 2× with STAR). In addition, the proposed strategy has also demonstrated to be quite robust against indels and mismatches. This proposal provides a simple, fast and elegant solution that maps almost all the reads, even those containing a high number of mismatches or indels. This solution also saves a substantial amount of time in the mapping step which, consequently, critically contributes to the acceleration of the current pipelines of sequencing data processing. This strategy, implemented in a program that makes use of different high-performance computing (HPC) technologies, HPG Aligner, shows an excellent performance with both, short and long reads, with runtimes presenting only a linear dependence with the number of reads.
Schema of the implementation of the mapping process.
(Top) Contiguous seeds of size (16 bp) are taken covering the whole read. Also, two more overlapping seeds near the ends of the read are taken for anchoring the ends to the exons. (Middle) Seeds are mapped without allowing any mismatch. Seeds mapped closer than read size and, in the same strand orientation, constitute a candidate alignment location (CAL). (Bottom) CALs closer than 500,000 bp and, in the same strand orientation, are clustered to form candidate exons and transcripts. These are evaluated and scores based on SW are assigned to them. This figure is available in black and white in print and in colour at DNA Research online.
Availability – HPG Aligner is free and open source. Documentation and software are available at: https://github.com/opencb/hpg-aligner/wiki.