RNA sequencing (RNA-seq) is a common and widespread biological assay, and an increasing amount of data is generated with it. In practice, there are a large number of individual steps a researcher must perform before raw RNA-seq reads yield directly valuable information, such as differential gene expression data. Existing software tools are typically specialized, only performing one step– such as alignment of reads to a reference genome– of a larger workflow. The demand for a more comprehensive and reproducible workflow has led to the production of a number of publicly available RNA-seq pipelines. However, Johns Hopkins School of Medicine researchers have found that most require computational expertise to set up or share among several users, are not actively maintained, or lack features they have found to be important in their own analyses. In response to these concerns, the researchers have developed a Scalable Pipeline for Expression Analysis and Quantification (SPEAQeasy), which is easy to install and share, and provides a bridge towards R/Bioconductor downstream analysis solutions. SPEAQeasy is user-friendly and lowers the computational-domain entry barrier for biologists and clinicians to RNA-seq data processing as the main input file is a table with sample names and their corresponding FASTQ files. SPEAQeasy is portable across computational frameworks (SGE, SLURM, local, docker integration) and different configuration files are provided.
SPEAQeasy workflow diagram
A simplified workflow diagram for each pipeline execution. The red box indicates the FASTQ files are inputs to the pipeline; green coloring denotes major output files from the pipeline; the remaining boxes represent computational steps. Yellow-colored steps are optional or not always performed; for example, preparing a particular set of annotation files occurs once and uses a cache for further runs. Finally, blue-colored steps are ordinary processes which occur on every pipeline execution. The workflow proceeds downward, and each row in the diagram implicitly represents the ability for several computation steps to execute in parallel.
Availability – The SPEAQeasy code is available on GitHub https://github.com/LieberInstitute/SPEAQeasy and https://github.com/LieberInstitute/SPEAQeasy-example, and can be expanded through interactions with users.