When estimating expression of a transcript or part of a transcript using RNA-Seq data, it is commonly assumed that reads are generated uniformly from positions within the transcript. While this assumption is acceptable for long transcript sequences, it frequently leads to large errors for short sequences, e.g., less than 100 bp. Analysis of short sequences, such as splice junctions adjacent to alternatively spliced axons and microRNAs, is increasingly important and necessitates addressing errors in short-sequence expression estimation.
Indeed, when researchers from the University of Toronto examined RNA-Seq data from diverse studies, they found that large errors are introduced by variations RNA-Seq coverage due to sequence content, experimental conditions and sample preparation. The researchers developed a technique that they call the positional bootstrap, which quantifies the level of uncertainty in expression induced by non-uniform coverage. Unlike methods that attempt to correct for biases in coverage, but do so by making strong assumptions about the form of those biases, the positional bootstrap can quantify the noise induced by all types of bias, including unknown ones. Results obtained using independently generated RNA-Seq datasets show that the positional bootstrap increases the accuracy of estimates of alternative splicing levels, tissue-differential alternative splicing and tissue differential expression, by a factor of up to 10.
Sequence-, experiment-, and dataset-dependent biases in real RNA-seq data
A-B: Distribution of the number of positions with at least one mapped read spanning an exon-exon junction in the Bodymap data (red), randomly resampled Bodymap data with effects of experiment-dependent bias canceled (green), and simulated data with no sequence- and experiment-dependent bias (blue). n is the total number of reads mapped to the junction. C: For each pair of tissues, the observed frequency of positions in pairs of tissues that have more reads than the median. All lines are above 0:5, showing reads tend to be mapped to the same positions in two tissues. D: Distribution of junctions by the proportion of reads coming from the Bodymap dataset versus Kaessman’s dataset. Solid line: RNA-seq data. Dotted line: expected distribution if there exists no dataset-dependent differences across junctions. The real data have a much wider distribution, reflecting dataset-dependent differences in sequencing technology and sample variability.
Availability – An efficient Python implementation of the algorithm is freely available from github.com/PSI-Lab/BENTO-Seq.