
Graphical model of the RNA-seq mixture problem. Given a known Transcriptome T and some observed reads R, the inference problem is for through the latent variables Z
Assigning RNA-seq reads to their transcript of origin is a fundamental task in transcript expression estimation. Where ambiguities in assignments exist due to transcripts sharing sequence, e.g. alternative isoforms or alleles, the problem can be solved through probabilistic inference. Bayesian methods have been shown to provide accurate transcript abundance estimates compared to competing methods. However, exact Bayesian inference is intractable and approximate methods such as Markov chain Monte Carlo (MCMC) and Variational Bayes (VB) are typically used. While providing a high degree of accuracy and modelling flexibility, standard implementations can be prohibitively slow for large datasets and complex transcriptome annotations.
A team led by researchers at the Sheffield Institute for Translational Neuroscience have devloped a novel approximate inference scheme based on VB and have applied it to an existing model of transcript expression inference from RNA-seq data. Recent advances in VB algorithmics are used to improve the convergence of the algorithm beyond the standard Variational Bayes Expectation Maximisation (VBEM) algorithm. They applied their algorithm to simulated and biological datasets, demonstrating a significant increase in speed with only very small loss in accuracy of expression level estimation. The researchers also carried out a comparative study against seven popular alternative methods and demonstrate that this new algorithm provides excellent accuracy and inter-replicate consistency while remaining competitive in computation time.
Ranking of methods for five replicates of simulated RNA-seq reads. WGE-Inter: inter-replicate consistency of within gene estimates, WGE-True: within gene estimates compared to the true values and Theta: estimated relative transcript expression compared to the true values. Scores have been normalised to unity per dataset.
Availability – The methods were implemented in R and C++, and are available as part of the BitSeq project at github.com/BitSeq. The method is also available through the BitSeq Bioconductor package. The source code to reproduce all simulation results can be accessed via github.com/BitSeq/BitSeqVB_benchmarking
Contact – Magnus.Rattray@manchester.ac.uk