by Matthew MacManes
I was pointed to a new paper in PLOS ONE: An Extensive Evaluation of Read Trimming Effects on Illumina NGS Data Analysis. Their central thesis seems to be this:
“….trimming is beneficial in RNA-Seq, SNP identification and genome assembly procedures, with the best effects evident for intermediate quality thresholds (Q between 20 and 30).
This it a topic about which I have thought a lot, as I’ve recently written up a manuscript on the same topic: On the optimal trimming of high-throughput mRNAseq data (Blog post). I show that anything more than VERY gentle trimming is harmful to de novo assembly and transcriptome characterization. My findings seem to be in conflict with those presented by Giorgi and colleagues. I’ll tell you up from that I think I’m right, at least for the RNAseq part of their paper.
With regards to RNAseq, they show that percentage of reads mapped to the reference increases with moderate trimming (red bars), then decreases with more aggressive >Q30 trimming.. Note that I don’t think that better mapping is necessarily equivalent to better RNAseq results, but save that issue for later..
Anyway, I don’t think we really care about the percentage of reads mapped correctly, we care about the total number of reads correctly mapped. Surely, 99% mapping of a 1M read dataset is much worse than 80% mapping of a 100M read dataset. This is basically what they show, that trimmings reduces the size of the dataset (blue bars), but increases the mapping rate (red bars).. No big deal there.
Again, what we really care about is the absolute number or reads mapped correctly, and when you look at that- trimming, particularly at their ‘best’ trimming thresholds looks anything but beneficial for RNAseq- Here is their data plotted using the info contained in their supplementary table S1. See what happened to the number of reads mapped as trimming threshold increases?
This shows that trimming past Q5 (Q10 for fastX) results in a reduction in the absolute number or reads mapping– the reduction is really profound at the trimming levels they report as best! At Q30, only 10% of the reads map, as compared to 72% of the reads in the untrimmed dataset. I’m not going to spend the time to determine if this reduction is meaningful to the downstream RNAseq analyses (though the authors of the paper should have), but I’m going to suggest that this amount of reduction would be detrimental to any RNAseq experiment, not beneficial, as the authors claim..
So, it trimming beneficial to RNASeq- the answer is no- at least beyond very gentle trimming.