RNA sequencing has become an increasingly affordable way to profile gene expression analyses. Researchers from the LNCC, Brazil have developed a scientific workflow implementing several open-source software executed by Parsl parallel scripting language in an high-performance computing environment. The researchers have applied the workflow to a single-cardiomyocyte RNA-seq data retrieved from Gene Expression Omnibus database. The workflow allows for the analysis (alignment, QC, sort and count reads, statistics generation) of raw RNA-seq data and seamless integration of differential expression results into a configurable script code.
In this work, the researchers aim to investigate an analytical comparison of executing the workflow in Solid State Disk and Lustre as a critical decision for improving the execution efficiency and resilience in current and upcoming RNA-Seq workflows. Based on the resulting profiling of CPU and I/O data collection, they demonstrate that they can correctly identify anomalies in transcriptomics workflow performance which is an essential resource to optimize its use of high-performance computing systems. ParslRNA-Seq showed improvements in the total execution time of up to 70% against its previous sequential implementation. Finally, the researchers discuss which workflow modeling modifications lead to improved computational performance and scalability based on provenance data information.
Availability – ParslRNA-Seq is available at https://github.com/lucruzz/rna-seq