Mixture models reveal multiple positional bias types in RNA-Seq data

Accuracy of transcript quantification with RNA-Seq is negatively affected by positional fragment bias. Researchers at Lexogen GmbH introduce Mix2 (rd. “mixquare”), a transcript quantification method which uses a mixture of probability distributions to model and thereby neutralize the effects of positional fragment bias. The parameters of Mix2 are trained by Expectation Maximization resulting in simultaneous transcript abundance and bias estimates.

The researchers compare Mix2 to Cufflinks, RSEM, eXpress and PennSeq; state-of-the-art quantification methods implementing some form of bias correction. On four synthetic biases they show that the accuracy of Mix2 overall exceeds the accuracy of the other methods and that its bias estimates converge to the correct solution. They further evaluate Mix2 on real RNA-Seq data from the Microarray and Sequencing Quality Control (MAQC, SEQC) Consortia. On MAQC data, Mix2 achieves improved correlation to qPCR measurements with a relative increase in R2 between 4% and 50%. In addition, Mix2 reveals 5 dominant biases in MAQC data deviating from the common assumption of a uniform fragment distribution.  The researchers further observe improved repeatability across laboratory sites with a relative increase in R2 between 8% and 44% and reduced standard deviation.

Types of biases detected in lane SRR037445 of UHR in MAQC data set
and their transcript length distributions

rna-seq

(a) to (f) 6 most prominent biases which account for 73.43% of transcripts. Bias on the left, transcript length distribution on the right. (g) Bias and transcript length distribution of complete, unclustered set of transcripts.

Availability – Mix2 has been implemented as an Octave script with readable code and as a closed source C++ implementation. Both versions can be downloaded from https://www.lexogen.com/mix-square-scientific-license.

Tuerk A, Wiktorin G, Güler S (2017) Mixture models reveal multiple positional bias types in RNA-Seq data and lead to accurate transcript concentration estimates. PLoS Comput Biol 13(5): e1005515. [article]

One comment

  1. Kristoffer Vitting-Seerup

    Remember that Kallisto, Salmon and Alpine already models these (and more) biases

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.