from RoBlog by Rob Patro – Thoughts and musings on science and life
I’ve recently been working with my student Avi Srivastava on a new project, RapMap. RapMap is a tool for mapping (see below for the distinction between mapping and alignment) sequencing reads to a transcriptome. It currently implements two related but distinct mapping strategies, which I refer to as pseudo-mapping (this is no different from pseudo-alignment, an idea developed by Bray et al. as described below, but I will use the name pseudo-mapping for consistency) and quasi-mapping. A pre-print is forthcoming that will describe the tool, methods and results in further detail. However, in this post, I want to describe the main ideas and motivation behind this project, and provide a few of the results that we found so astounding.
Alignment vs. Mapping
First, let me attempt to explain what RapMap is not. This will require me to draw a distinction between alignment and mapping. Read alignment produces a nucleotide-to-nucleotide correspondence between each query sequence it deems sufficiently similar to some subsequence(s) of the reference. The result of computing such alignments is a collection of locations in the reference, as well as a specification (most commonly represented as a CIGAR string) of the correspondence between the query and the subsequence of the reference at this location that yields the smallest cost (or, equivalently, the highest score) under some specified model of sequence distance (or similarity). Such alignments are the de facto currency of many higher-level genomic analyses. It is important to note that the most common read aligners (Bowtie 2, BWA, STAR) are not exhaustive; that is, they do not guarantee that they will discover every subsequence of the reference that is within the specified distance of the query. Though such exhaustive or fully-sensitive tools do exist, they tend to be substantially slower, and are used primarily for tasks in which finding every admissible alignment is essential.
In comparison to alignment, mapping is the determination of the set of locations in the reference that match the query sequence (i.e. the read) “sufficiently well”. Here, I intentionally leave the term “sufficiently well” somewhat vague, as the problem is less well-studied than alignment and a reasonable determination of what constitutes “sufficiently well” is still a reasonable topic for discussion and debate. The key difference between mapping and alignment, as described here (again, I note that these terms are often used interchangeably in the broader community), is that mapping need not, and typically does not, compute or report a nucleotide-to-nucleotide correspondence between the query and locations on the reference. Avoiding this computation provides hope that reads can be mapped much faster than they can be aligned and, indeed, this seems to be the case.
RapMap is a read mapper. Given a reference (in this case, a transcriptome) and set of sequencing reads, it computes the likely loci of origin of each sequenced read with respect to the reference. The question, of course, is how to compute these locations rapidly and accurately. RapMap implements two distinct but related approaches, both of which achieve these goals in spades.
What’s in RapMap?
The first algorithm provided by RapMap is an independent re-implementation of the method of pseudo-alignment (herein called pseudo-mapping) that was first introduced by Bray et al. in the pre-print for the RNA-seq quantification tool Kallisto. Re-implementing a method is a fantastic way to understand how it works, and the initial motivation behind writing the first part of the RapMap coe was to really grok the innards of pseudo-mapping by implementing it. A fringe benefit of this is to provide an implementation of this idea under a different license (currently RapMap is licensed under GPL) than that adopted by Kallisto, whose license was the cause of some debate (a blog post by Lior Pachter, two by Titus Brown (1), (2), and two over at homolog.us (1), (2) ) a few weeks back. While the details of our re-implementation differ in some places from the implementation provided in Kallisto, we neither expect nor observe that these differences have a substantial effect on the speed or quality of the pseudo-mapping process. Thus, I refer the curious reader to the Kallisto pre-print (and the even-more-curious reader to the Kallisto and RapMap source code) for a more thorough description of pseudo-mapping.
The second algorithm provided by RapMap — quasi-mapping — is a novel one1, and thus the one upon which I wish to focus in this blog post. It is inspired by some ideas from the venerable STAR aligner, and from understanding the essence of what makes pseudo-mapping so fast in the context of transcriptomes. I will provide a high-level overview of how quasi-mapping works but, again, the forthcoming pre-print and the RapMap software itself will be the best place to understand the requisite details.