The latest project from a group of MU College of Engineering researchers could be a gamechanger in the world of biological research.
Jianlin Cheng, associate professor of computer science, and doctoral students Jilong Li and Jie Hou — with help from members of the MU Center for Botanical Interaction Studies, Division of Biological Sciences, Department of Chemistry, Department of Biochemistry, Informatics Institute and Bond Life Science Center — recently developed RNAMiner, a website making it easier for those in the biological sciences to analyze genomic and transcriptomic data.
The paper accompanying the website’s creation, “From Gigabyte to Kilobyte: A Bioinformatics Protocol for Mining Large RNA-Seq Transcriptomics Data,” appeared in the April 22 edition of PLoS One.
RNA-Seq is the means by which researchers use the most modern sequencing techniques to study RNA (ribonucleic acid). The process increased the speed by which researchers could note differences in gene expression between genomes. Cheng saw a need to apply a Big Data approach to improve the speed and user-friendliness of the process of mining these unwieldy large RNA-Seq datasets.
“This work actually started mainly by the demand of our scientists on campus. When I started here, I mostly worked on the protein structure function modeling and collaborated with many people like (the MU Center for Botanical Interaction Studies),” Cheng said.
“When analyzing the data, we found some general rules, general protocols that can really solve the problem arising from all those domains. Then we came up with this one paper to study how we really generate this kind of data and how that can be applicable to any species, any problem that produces data.”
The paper outlined five steps through which the team’s Big Data pipeline trims terabytes of transcriptomic data — the entire group of RNA transcripts produced by a given genome given specific parameters — down to just the integral data needed for a specific research project:
- mapping RNA-Seq reads to a reference genome,
- calculating gene expression values,
- identifying differentially expressed genes,
- predicting gene functions,
- constructing gene regulatory networks.
“Here is the raw data,” Cheng said. “Now we compress that basically hundreds of thousands of times, even one million times, to this level. Now they’re valuable. Now our collaborators can really take this info to identify the genes that cause diseases or certain traits of plants and do some experiments to verify their findings.”
The website was created to be user-friendly to remove the need for biological researchers to have a baseline level of computing skills to use. The interface allows users to upload data, analyze it through as many of the five steps as they want, and genomes have been uploaded for five species so far: human, mouse, Drosophila melanogaster (a type of fly), TAIR10 arabidopsis (a small flowering plant) and Clostridium perfringens (a type of bacterium). Genomic data for any species is welcome for upload to grow the database.
“To use our pipeline, you don’t have to know about those [computing] tools,” Li said. “You just need to upload those files and select several parameters, and it will automatically give those results.”
The funding for the project came from a National Institutes of Health Botanical Center grant, an NIH RO1 grant and Cheng’s National Science Foundation CAREER Award. The website is free for anyone to use, and if further hands-on, human analysis is needed, Cheng’s research team can also collaborate with researchers to carry out the analysis. The RNAMiner web service is available at http://calla.rnet.missouri.edu/rnaminer/index.html.
RNAMiner was created to be user-friendly to remove the need for biological researchers to have a baseline level of computing skills to use. The interface allows users to upload data, analyze it through as many of the five steps as they want, and genomes have been uploaded for five species so far: human, mouse, Drosophila melanogaster (a type of fly), TAIR10 arabidopsis (a small flowering plant) and Clostridium perfringens (a type of bacterium). Genomic data for any species is welcome for upload to grow the database.
On average, Hou said, two gigabytes of data takes approximately 10 hours for the servers to process and analyze.
“The only thing researchers have to do is upload the data and usually get results within a couple of hours,” Hou said.
Cheng said he’s been pleasantly surprised at the amount of feedback the team received in a relatively short period of time after unveiling the website. Given the possibilities of such a cost-effective, thorough resource, interest likely won’t wane anytime soon.
“This one is surprising. So many people are coming to us to submit data, inquire, both internally and externally. That’s something we never imagined, creating these bonds with this work,” he said.
Source – MU News