Computer Scientists at WPI are developing tools to make sense of RNA sequencing data


With an award from the National Institutes of Health, a team led by Dmitry Korkin will develop next-generation machine learning algorithms that could advance our understanding of the molecular biology of disease and the field of personalized medicine

A team of computer scientists at Worcester Polytechnic Institute (WPI) has received a two-year, $347,000 award from the National Institutes of Health to develop and evaluate new computational techniques that will provide a better understanding of the genetic and molecular interactions that underpin complex diseases. For example, the tools will help predict the likelihood that specific genetic mutations or patterns of mutations will lead to diabetes, neurological disorders, cancer, and other maladies; the likely outcome of those diseases; and how well those conditions will respond to treatment.

Led by Dmitry Korkin, associate professor of computer science and director of WPI’s Bioinformatics and Computational Biology Program, the team will develop tools for sifting through the vast amount of data now being produced by next-generation sequencing techniques about genetic mutations linked to various diseases, as well as the alternative gene products that occur in diseased tissues, to develop a deeper understanding of the complex interactions of genes, RNA molecules, and proteins within cells that ultimately shape the inception and progress of diseases.

“The more we learn about how complex diseases work at the molecular level,” Korkin said, “the more we come to appreciate the intricate web of molecular interactions that are key to why one person gets sick, while another with similar mutations does not, or why one person’s cancer responds to chemotherapy, while another is unaffected. Studying these complex interaction networks in the laboratory with high-throughput techniques is extremely time consuming and expensive, which is why our understanding of these networks is very limited.”

Korkin says existing big data tools are limited in their ability to model complex biological networks and how they change in a disease state. And while the hope is that such tools could one day replace laboratory experiments, or at least help scientists determine which experiments are likely to yield the most useful results, they are not yet up to that task. With the NIH award, Korkin said he hopes to bridge that gap by developing new kinds of computational methods that draw on an area of artificial intelligence known as machine learning.

He said the goal is to better model the complex web of molecular interactions within cells that begins when genes, which contain the genetic code for making proteins, are transcribed into RNA molecules. RNA, in turn, transfers the genetic information to machinery within the cell that uses it to assemble specific proteins. Finally, the myriad proteins produced within the cell interact in an intricate molecular ballet. In particular, Korkin said his aim is to create algorithms that can predict how this dense web of interactions changes when the genes develop mutations. These insights could help lay the foundation for personalized medicine, in which physicians will have the tools to predict the likely course of a particular disease in individual patients and prescribe individualized treatments.

While modelling the interactions of protein networks is challenging enough, Korkin said another emerging concept in molecular biology adds a new layer of complexity. It’s called alternative splicing, and it expands the classic model of genetics: that the information coded into a single gene represented the blueprint for one and only one protein. It is now known that most genes are capable of producing multiple proteins, depending on which sections of the gene are transcribed by RNA molecules. Cells use a variety of regulatory mechanism to determine which proteins will be produced at any one moment, but various diseases, including cancer, can also change the way a gene is “spliced.”

“Alternative splicing is now seen as one of the cell’s most important regulatory mechanisms,” Korkin said. “With each gene potentially able to produce five, six, or even dozens of different proteins, understanding how these variations affect cell function could be a very powerful tool for biology and medicine, since alternative splicing appears to be a more powerful mechanism for bringing about profound changes in the cell than mutations or alterations in gene expression.”

Korkin says the computational tools his group will develop with the new NIH award will also be able to account for the effects of alternative splicing, and the knowledge gained with those tools could advance our understanding of biology and improve healthcare. For example, he said it is believed that some genes produce different proteins in different tissues, at different times of day, or under different environmental stressors. It is also believed that the genes in tumors may express different proteins at different pathological stages. Having a tool that can predict these alterations could significantly enhance how diseases are diagnosed and how new treatments are developed and administered, he said.

In a recent paper in the journal RNA (“Biological classification with RNA-Seq data: Can alternative spliced transcript expression enhance machine learning classifier?”), Korkin and his colleagues tested whether machine learning tools that use data about alternative splicing perform better than tools that rely on data about gene expression (or data about how the genetic code is translated into proteins). In the paper, the challenge presented to the algorithms was to take molecular data about tissue samples and identify the tissue types, the age and gender of the individuals from which samples were taken, whether the tissues were healthy or cancerous, and the pathological stages of individual tumors.

They found that in virtually every case, the alternative splicing data was better at classifying the samples than genetic sequencing data, and that in many cases it produced classifications with 100 percent accuracy. “Whether we looked at tissue-specific effects, developmental changes, or disease stage data, everything was classified with greater accuracy by using the alternative splicing data,” Korkin said.

“With the new NIH award, we will have the resources to take our machine learning tools to the next level and help contribute to the most exciting emerging areas of biology and personalized medicine.”

Source – Worcester Polytechnic Institute

Johnson NT, Dhroso A, Hughes KJ, Korkin D. (2018) Biological classification with RNA-Seq data: Can alternatively spliced transcript expression enhance machine learning classifier? RNA [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.