Decreasing costs make RNA sequencing technologies increasingly affordable for biologists. However, many researchers who can now afford sequencing lack access to resources necessary for downstream analysis. This means that even as algorithms to process RNA-Seq data improve, many biologists still struggle to manage the sheer volume of data produced by next generation sequencing (NGS) technologies. Scalable bioinformatics tools that exploit multiple platforms are needed to democratize bioinformatics resources in the sequencing era. This is essential for equipping many research groups in the life sciences with the tools to process the increasingly unwieldy datasets they produce.
One strategy to address this challenge is to develop a modern generation of sequence analysis tools capable of seamless data sharing and communication. Such tools will provide interoperability through offerings of interlinked resources. Systems of interlinked, scalable resources, which often incorporate cloud data storage, are broadly referred to as cyberinfrastructure. Cyberinfrastructure integrated tools will help researchers to robustly analyze large scale datasets by efficiently sharing data burdens across a distributed architecture. Additionally, interoperability will allow emerging tools to cross-adapt features of existing tools. It is important that these tools are designed to be easy to use for biologists.
Researchers at the Donald Danforth Plant Science Center have developed fRNAkenseq, a powered-by-CyVerse RNA sequencing analysis tool that exhibits interoperability with other resources and meets the needs of biologists for comprehensive, easy to use RNA sequencing analysis. fRNAkenseq leverages a complex set of Application Programming Interfaces (APIs) associated with the NSF-funded cyberinfrastructure project, CyVerse, to execute FASTQ-to-differential expression RNA-Seq analyses. Integrating across bioinformatics platforms, fRNAkenseq also exploits cloud integration and cross-talk with another CyVerse associated tool, CoGe. fRNAkenseq offers novel features for the biologist such as more robust and comprehensive pipelines for enrichment than those currently available by default in a single tool, whether they are cloud-based or local installation. Importantly, cross-talk with CoGe allows fRNAkenseq users to execute RNA-Seq pipelines on an inventory of 47,000 archived genomes stored in CoGe or upload their own draft genome.
Data lifecycle of fRNAkenseq job and schema illustrating how components
across CyVerse and CoGe relate to one another
(A) Outline of Agave App, (B) schema for MapCount, (C) schema for DiffExpress, (D) outline for connectivity with fRNAkenseq and CoGe as well as Data Store. User data uploaded into CyVerse is accessible for analysis by the fRNAkenseq front end. The path followed by user uploaded sequencing data during an analysis is illustrated in the green