A new computational method makes gene expression analyses from RNA-Seq data more accurate

Technique detects technical biases that otherwise confound test results

A new computational method can improve the accuracy of gene expression analyses, which are increasingly used to diagnose and monitor cancers and are a major tool for basic biological research.

Researchers from Carnegie Mellon University, Stony Brook University and Dana-Farber Cancer Institute said their method, called Salmon, can correct for the technical biases known to occur during RNA sequencing (RNA-seq), the leading method for estimating gene expression. Furthermore, it operates at similar speeds as other fast methods — a critical factor as these tests grow more common and numerous.

Their report is being published online Monday, March 6, by the journal Nature Methods. Carl Kingsford, associate professor in CMU’s Computational Biology Department, said the Salmon source code is freely available online and already has been downloaded by thousands of users.

“Salmon provides a much richer model of the RNA-seq experiment and of the possible biases that are known to occur during sequencing,” Kingsford said. This is important, he added, because the technique is increasingly used for classifying diseases and their subtypes, understanding gene expression changes during development, and tracking the progression of cancer.

Though an organism’s genetic makeup is static, the activity of individual genes varies greatly over time, making gene expression an important factor in understanding how organisms work and what occurs during disease processes. Gene activity can’t be efficiently measured directly, but can be inferred by monitoring RNA, the molecules that carry information from the genes for producing proteins and other cellular activities.

RNA-seq is a leading technology for producing these snapshots of gene activity. But depending on the tissue being analyzed and the way each sample is prepared, various experimental biases can occur and cause RNA-seq “reads” to be over- or undersampled from various genes, Kingsford said.

“Though we know many of the kinds of biases that can occur, modeling them has to occur on a sample-by-sample basis,” he said. “If you have to build a complicated bias model using traditional methods, it takes a really long time.”

The researchers named the method after a fish famous for swimming upstream because it employs an algorithm that can estimate the effect of biases and the expression level of genes as experimental data streams by.

“In that way, it can build up a rich bias model and do so approximately as fast as other fast analysis tools,” Kingsford said.

Overview of Salmon’s method and components and execution timeline

rna-seq

Salmon accepts either raw (green arrows) or aligned (gray arrow) reads as input. When processing quasi-mappings or aligned reads, Salmon executes an online inference algorithm. This ensures that transcript abundance estimates are available to estimate weights for the rich equivalence classes, and to consider the appropriate conditional probabilities when learning the experimental parameters and foreground bias models. After a fragment’s contributions to the online abundance estimates and bias models have been computed, the fragment is placed into an appropriate equivalence class (or one is created if it does not yet exist). Once all of the fragments have been observed, the initial abundances and fragment equivalence classes are passed to the offline inference module. The offline module learns the background bias models (based on initial abundance estimates) and then corrects the effective transcript lengths to account for the appropriate biases. Finally, the offline inference algorithm (EM or VBEM) is run over the reduced representation of the data until convergence. Once estimation is complete, posterior samples are generated via Gibbs sampling or a bootstrap procedure if the user has requested this.

Source – Carnegie Mellon University

Availability – The source code for Salmon is freely available and licensed under the GNU General Public License (GPLv3). The latest version of Salmon can be obtained from https://github.com/COMBINE-lab/salmon.

Patro R, Duggal G, Love MI, Irizarry RA, Kingsford C. (2017) Salmon provides fast and bias-aware quantification of transcript expression. Nature Methods [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.