With the advent of Next-Generation Sequencing (NGS) technologies, numerous data is being generated every day, however, analysis remains a big hurdle to efficiently use the technology as this data requires complex multi-step processing and demands computational expertise from the user. A large number of algorithms, statistical methods, and software tools have been developed in recent years to perform the individual analysis steps of various NGS applications. Some NGS applications data analysis procedures are therefore very complex, requiring several program tools for their various processing steps. As a result, there is a strong need for scalable computing environments that link the individual software components to automated workflows to efficiently and reproducibly conduct complex genome-wide analyses. The Python programming language currently only has inadequate general-purpose NGS workflow solutions. Therefore, for theoretical and analytical scientists who use Python for NGS data processing, a workflow system for federating NGS applications from within Python would have many benefits.
To conquer this limitation, researchers at Utah State University have developed a Python package (pySeqRNA) which is capable of running the NGS data analysis from start to finish reproducibly and efficiently. This package provides a uniform workflow interface and support for running python, and stand-alone tool on the High-Performance Computing Cluster (HPCC) as well as on local computers. This is a flexible pipeline that can handle complex experiments and samples, and whether a reference genome is available or not. It is an extensible environment written in Python for performing end-to-end analysis with automated report generation for various NGS applications like RNA-Seq, VAR-Seq, ChiP-Seq, Single Cell RNA-Seq, dual RNA-Seq, etc. To simplify the analysis of these applications, the package provides pre-configured analysis and report templates. More analysis templates will be added in the coming future.
pySeqRNA workflow consists of quality check and pre-processing of raw sequence reads, accurate mapping of millions of short sequencing reads to a reference genome including the identification of splicing events, quantifying expression levels of genes, transcripts, and exons in three ways: (i) Uniquely mapped reads, (ii) Multi-mapped reads to the same gene, and (iii) Multi-mapped groups, and Differential analysis of gene expression among different biological conditions, biological interpretation of differentially expressed genes, including functional enrichment analysis. This package accelerates the retrieval of reproducible results from NGS experiments. By integrating several command-line tools and custom Python scripts, it allows an effective use of existing software and tools with newly written scripts in Python without restricting users to a collection of pre-defined methods and environments.
Poster presented at – 28th International Conference on Intelligent Systems for Molecular Biology (ISMB) 2020
Availability – pySeqRNA is freely available at http://bioinfo.usu.edu/pySeqRNA/.