RNA-Seq Data Analysis – Where To Start?

Poll Results GraphIn our last reader poll, we asked: Do we yet have a firm handle on the most appropriate/accurate pipeline for analysis of RNA-Seq datasets?

The overwhelming result was: NO. N=72

We’re hearing feedback now from scientists developing data analysis methods for RNA-Seq.  Would very much like to hear from more of you.  Please send your comments to contribute@rna-seqblog.com.

by Dr. Raffaele A. Calogero, Bioinformatics and Genomics Unit,  MBC Centro di Biotecnologie Molecolari, Torino, Italy

Concerning the poll results I perfectly understand the frustration of the researchers. The reason why we started to optimize a miRNA-seq pipeline was due to the fact that we did not want to loose information simply because we did not use the right tools combination in the analysis pipeline. We simply applied to miRNA-seq the same approach used in the past for microarray data analysis.

Although RNA-seq seems to be an extremely powerful technique, only now we are getting information on bias and criticality. It is a time very similar to the beginning of the microarrays data analysis: a lot of excitement, a lot of new tools, but very little knowledge on the effect of tools integration in a pipeline.

In my opinion a keyword for pipeline optimization is “benchmark dataset”. An important step in the analysis of the 3’IVT Affymetrix came from the availability of spike-in experiments allowing the optimization of normalization/summarization algorithms (Irizzarry et al. PMID:18676452). Subsequently similar approaches were used for exon-arrays (Abdueva et al. PMID: 17878948; Della Beffa et al.  PMID:19040723). Also in our paper we took advantage of a miRNA-seq spike-in experiment (Willenbrock et al. PMID:19745027).

On the basis of previous works, it is evident that we desperately need experimental benchmark datasets to optimize RNA-seq pipelines. Some good work in this direction is due to Jiang et al. (PMID:21816910). However, I find very hard to think at experimental spike-in datasets that will be able to cover all possible steps of RNA-seq pipeline for isoform discovery or quantification. Maybe a combination of experimental spike-in with synthetic data (http://cbil.upenn.edu/BEERS/) might represent the way to evaluate the strength and limits of mRNA-seq pipelines.