Rob Patro, an assistant professor in the Department of Computer Science in Stony Brook’s College of Engineering and Applied Sciences, leads a group of computational biological researchers that developed a new software tool, Salmon — a lightweight method to provide fast and bias-aware quantification from RNA-sequencing reads. The research was published in the March 6 edition of Nature Methods.
The team includes researchers from the Department of Computer Science at Stony Brook University, University of North Carolina–Chapel Hill, Harvard School of Public Health, Carnegie Mellon School of Computer Science, and private industry.
“This research represents a perfect storm for computer science,” said Computer Science Chair Arie Kaufman. “We have a group of knowledge-driven collaborators from across the United States, funded by multiple sources, and striving for advancing genomic research by developing an innovative tool. I congratulate them on this discovery.”
In genomics, transcript abundance estimates are used to classify diseases and their subtypes, to understand how gene expression changes correlate with phenotype, and to track the progression of cancer. The accuracy of abundance estimates derived from RNA-seq data is especially urgent given the wide range of biases that affect the RNA-seq fragmentation and sequencing processes, and the use of expression data in studying disease and, eventually, for medical diagnosis and personalized treatments.
Created by researchers Rob Patro, Geet Duggal, Michael Love, Rafael A. Irizarry and Carl Kingsford, Salmon synthesizes, into one tool, many algorithmic and methodological advances that will power gene expression studies, both small- and large-scale.
Overview of Salmon’s method and components and execution timeline
Salmon accepts either raw (green arrows) or aligned (gray arrow) reads as input. When processing quasi-mappings or aligned reads, Salmon executes an online inference algorithm. This ensures that transcript abundance estimates are available to estimate weights for the rich equivalence classes, and to consider the appropriate conditional probabilities when learning the experimental parameters and foreground bias models. After a fragment’s contributions to the online abundance estimates and bias models have been computed, the fragment is placed into an appropriate equivalence class (or one is created if it does not yet exist). Once all of the fragments have been observed, the initial abundances and fragment equivalence classes are passed to the offline inference module. The offline module learns the background bias models (based on initial abundance estimates) and then corrects the effective transcript lengths to account for the appropriate biases. Finally, the offline inference algorithm (EM or VBEM) is run over the reduced representation of the data until convergence. Once estimation is complete, posterior samples are generated via Gibbs sampling or a bootstrap procedure if the user has requested this.
According to Patro, the hallmarks of the method are its speed, accuracy and robustness. Salmon runs at a similar speed to existing fast algorithms for quantifying gene expression, yet it incorporates a rich and expressive model of the underlying experiment, including many technical biases, and uses a new statistical inference procedure to estimate gene expression quickly and accurately.
“The methodological underpinnings of Salmon provide a framework upon which we can continue to build accurate models and efficient inference algorithms,” said Patro. “We are working on understanding and modeling an even larger array of potential technical biases that arise in RNA-seq-based gene expression studies. We are also particularly interested in how quantification algorithms can be made more accurate and robust in single-cell RNA-sequencing (scRNA-seq) experiments, which present unique algorithmic and statistical challenges.”
Salmon was developed with funding from the Gordon and Betty Moore Foundation’s Data-Driven Discovery Initiative, the National Science Foundation and National Institutes of Health.
Source – Stony Brook University
Availability – Salmon is open source and freely licensed (GPLv3). It is written in C++11 and is available at https://github.com/COMBINE-lab/Salmon.