The availability of terabytes of RNA-Seq data and continuous emergence of new analysis tools, enable unprecedented biological insight. There is a pressing requirement for a framework that allows for fast, efficient, manageable, and reproducible RNA-Seq analysis. Iowa State University researchers have developed a Python package, (pyrpipe), that enables straightforward development of flexible, reproducible and easy-to-debug computational pipelines purely in Python, in an object-oriented manner. pyrpipe provides access to popular RNA-Seq tools, within Python, via high-level APIs. Pipelines can be customized by integrating new Python code, third-party programs, or Python libraries. Users can create checkpoints in the pipeline or integrate pyrpipe into a workflow management system, thus allowing execution on multiple computing environments, and enabling efficient resource management. pyrpipe produces detailed analysis, and benchmark reports which can be shared or included in publications. pyrpipe is implemented in Python and is compatible with Python versions 3.6 and higher. To illustrate the rich functionality of pyrpipe, the developers provide case studies using RNA-Seq data from GTEx, SARS-CoV-2-infected human cells, and Zea mays.
The pyrpipe framework
This very simple example illustrates the relationship between the Python code that the user writes for pyrpipe, the corresponding Python objects, the YAML parameter files, the corresponding shell script, and the output. The user need only define the NCBI-SRA Run Accessions and the tools to be used, the rest is automatic. Here, a single RNA-Seq run is specified; alternatively, thousands of runs could be processed. A key advantage of pyrpipe is that it can be used to easily create complex workflows that are intuitive, understandable, reproducible, and modifiable. pyrpipe can automatically load and resolve tool parameters from YAML files; this allows the user to facilely modify and document parameters. pyrpipe is represented by the green boxes. The user writes the code in Python (blue text), creating Python objects of specific pyrpipe classes that provide APIs to RNA-Seq tools. To execute the full pipeline, the user need to run only the Python file, e.g. “python script.py –threads 10″, to designate executing the pipeline using 10 threads (Box A). Each object encapsulates specific methods and data (Box B). For example, each SRA object stores the directory path for the associated raw RNA-Seq data that is used as the default directory by pyrpipe to output files from different RNA-Seq processing steps, i.e., trimming, alignment, assembly or quantification. Tool parameters, if supplied in YAML files, are automatically loaded and stored in the corresponding pyrpipe object (Box C). During processing, shell commands are automatically constructed and executed by the pyrpipe APIs; pyrpipe provides this comprehensive output of bash commands so that the user can easily monitor the status of the job. (Box D). After execution, the pyrpipe_diagnostic tool generates extensive data analyses and diagnostic reports from the logs. These enable users to summarize, share, benchmark or debug their pipelines (Box E).
Availability – All source code is freely available at https://github.com/urmi-21/pyrpipe; the package can be installed from the source, from PyPI (https://pypi.org/project/pyrpipe), or from bioconda (https://anaconda.org/bioconda/pyrpipe). Documentation is available at (http://pyrpipe.rtfd.io).