Vamsi Veeramachaneni, Strand Life Sciences Pvt. Ltd.
There are two common methods for aligning small RNA reads (after we take care of such pesky details as trimming the adapters). We can
- create a reference collection of known small RNA sequences and align the reads against this collection, or
- we could align the reads against the entire genome and use genome-based annotations to later compute small RNA gene expression
Both methods have their own pros and cons. The first method is likely to be very fast, but the output is dependent on knowing exactly what we want to measure. The second method is more resource intensive but it allows you to check the correctness of the annotations for known genes, helps to find novel genes, and explores small RNA editing. For this post, we will focus on the second method.
Alignment against the genome can be tricky because of the number of parameters that need to be specified to the alignment tool. Two common parameters are:
- Minimum quality of alignment: specified as the number (or %) of mismatches/gaps that can be tolerated,
- Number of matches allowed: for reads which match in more locations than this threshold, we can either choose to report a random subset of the matches or ignore the read entirely.
Because of their longer reads and more accurate sequencing, the parameters discussed are not as critical for DNA-Seq or RNA-Seq experiments; we can demand 95% identity and unique matches from the tool, and still hope to get alignments suitable for downstream analysis.
However, when aligning small RNA reads which are only 18-35bp long against the genome, choosing the appropriate multiple-match threshold is crucial to accurately measure the expression of various small RNA species.
While analyzing a small RNA dataset that was provided by a collaborator, we experimented with different multiple-match thresholds using Avadis® NGS. We set the alignment parameters such that any read matching the hg19 genome up to a 1000 times would be reported with all of its matches. This resulted in more than 90% of the reads being aligned.
We were of course curious about the nature of the reads aligning to the genome with such high multiplicity. So we divided the reads into groups based on how “multiply-mapped” they were. The bins we chose were – 0 matches (unaligned), 1 match (unique), 2, 3-4, 5-8, 9-16, 17-32, 33-64, 65-128, 129-256, 257-512, and 513-1000 times matched. We counted the number of reads that fell in each bin and the number of matches that they contributed in the alignment. The figure that follows shows a cumulative distribution based on these numbers.
The blue line showed the cumulative distribution of reads that fell in the different multiplicity bins. It showed that 60% of the reads that matched did so uniquely. An additional 10% of the reads matched at two locations. When we allowed up to 32 matches, 98% of all the reads (that would align) had all their matches captured in the alignment output. However, to get alignments for the remaining 2% of reads, we needed to double the size of the alignment output (indicated by the red line).
This brought us to our second question; what were the regions in the genome to which these multiply matching reads were falling? So we took the matches falling in each of the bins mentioned above and used the “Genic region QC plot” in Avadis® NGS to see the locations (with respect to the RefSeq transcript model). The results were quite informative!
The first two pie-charts indicated that the reads that mapped uniquely (or at most twice) tended to be miRNA reads (shown by the gray-colored pie). That contribution from miRNA vanished in the second set of pie-diagrams (for matches> 5). Hence, if one is interested in miRNA analysis then one can ignore reads mapping to more than 10 locations. If instead one were performing tRNA quantification, one would need to allow for reads that map up to ~50 times.
We hope that with this example, the choice of parameters for carrying out small RNA alignment becomes easier. Choosing to store up to 1000 matches for each read will result in unnecessarily large files (5x-10x larger) if the end goal for instance is to simply analyze miRNAs.