An open RNA-Seq data analysis pipeline/tutorial with an example of reprocessing data from a recent Zika virus study

RNA-seq analysis is becoming a standard method for global gene expression profiling. However, open and standard pipelines to perform RNA-seq analysis by non-experts remain challenging due to the large size of the raw data files and the hardware requirements for running the alignment step.

Here researchers from the Icahn School of Medicine at Mount Sinai introduce a reproducible open source RNA-seq pipeline delivered as an IPython notebook and a Docker image. The pipeline uses state-of-the-art tools and can run on various platforms with minimal configuration overhead. The pipeline enables the extraction of knowledge from typical RNA-seq studies by generating interactive principal component analysis (PCA) and hierarchical clustering (HC) plots, performing enrichment analyses against over 90 gene set libraries, and obtaining lists of small molecules that are predicted to either mimic or reverse the observed changes in mRNA expression.


Workflow of the different steps carried out in the pipeline

The researchers apply the pipeline to a recently published RNA-seq dataset collected from human neuronal progenitors infected with the Zika virus (ZIKV). In addition to confirming the presence of cell cycle genes among the genes that are downregulated by ZIKV, their analysis uncovers significant overlap with upregulated genes that when knocked out in mice induce defects in brain morphology. This result potentially points to the molecular processes associated with the microcephaly phenotype observed in newborns from pregnant mothers infected with the virus. In addition, their analysis predicts small molecules that can either mimic or reverse the expression changes induced by ZIKV.


Hierarchical clustering heatmap of the 800 genes with the largest variance. The CPM of 800 genes with the largest variance across the eight samples were log transformed and z-score normalized across samples. Blue indicates low expression and red high.

Availability – The IPython notebook, as well as other scripts and data files for this tutorial are available on GitHub at:, doi:

Wang Z, Ma’ayan A. (2016) An open RNA-Seq data analysis pipeline tutorial with an example of reprocessing data from a recent Zika virus study. F1000 Research [Epub ahead of print]. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.