Next-generation RNA-sequencing (RNA-Seq) is rapidly outcompeting microarrays as the technology of choice for whole-transcriptome studies. However, the bioinformatics skills required for RNA-Seq data analysis often pose a significant hurdle for many biologists. Here, researchers at Utrecht University, The Netherlands put forward the concepts and considerations that are critical for RNA-Seq data analysis and provide a generic tutorial with example data that outlines the whole pipeline from next-generation sequencing output to quantification of differential gene expression.

RNA-Seq

Van Verk MC, Hickman R, Pieterse CM, Van Wees SC. (2013) RNA-Seq: revelation of the messengers. Trends Plant Sci [Epub ahead of print]. [abstract]

Incoming search terms:

  • rna-seq revelation of the messengers
  • rna-seq data pipeline
  • Differential Gene Expression heatmap from RNA-Seq data using cummeRbund
  • rna-seq: revelation of the messengers
  • ngs data analysis pipeline ion torrent
  • sigma semi-degenerate primer
  • spike-in normalization rna-seq
  • shrimp mirna ion proton
  • miRNA-seq analysis pipeline
  • mirna pipeline

ArrayStudio is a GUI software in Windows. OmicSoft also provides a command line tool (named Oshell). Oshell.exe is a .NET application, and it has also been optimized in the Linux environment using the MONO application. In this article, I will describe a complete OmicScript pipeline for RNA-Seq data analysis.

OmicSoft

OmicSoft is available at http://www.arrayserver.com/wiki/index.php?title=OmicScript_pipeline_for_RNA-Seq_data_analysis

Incoming search terms:

  • transcriptome analysis pipeline
  • mirauto microrna
  • transcriptome sequencing data analysis pepline
  • construction of DNA microarrays
  • human rna-seq complete pipeline
  • mutational analysis rnaseq
  • tophat analysis pipeline human
  • fpkm rna-seq
  • transcriptome data handling pipeline
  • snps srr rna seq analysis software windows

Forward genetic screens in model organisms are vital for identifying novel genes essential for developmental or disease processes. One drawback of these screens is the labor-intensive and sometimes inconclusive process of mapping the causative mutation.

In order to leverage high-throughput techniques to improve this mapping process, scientists at the University of Utah have developed a Mutation Mapping Analysis Pipeline for Pooled RNA-seq (MMAPPR) that works without parental strain information, without the requirement of a pre-existing snp map of the organism, and adapts to differential recombination frequencies across the genome. MMAPPR accommodates the considerable amount of noise in RNA-seq datasets, calculates allelic frequency by Euclidean distance followed by Loess regression analysis, identifies the region where the mutation lies and generates a list of putative coding region mutations in the linked genomic segment. MMAPPR can exploit RNA-seq datasets from isolated tissues or whole organisms that are utilized for gene expression and transcriptome analysis in novel mutants.

The researchers tested MMAPPR on two known mutant lines in zebrafish , nkx2.5 and tbx1, and used it to map two novel ENU-induced cardiovascular mutants, with mutations found in the ctr9 and cds2. MMAPPR can be directly applied to other model organisms, such as Drosophila and C. elegans, that are amenable to both forward genetic screens and pooled RNA-seq experiments. Thus, MMAPPR is a rapid, cost-efficient, and highly automated online pipeline, available to perform mutant mapping in any organism with a well assembled genome.

MMAPPR

Availability – MMAPPR is available at: http://yost.genetics.utah.edu/software.php

  • Hill JT, Demarest BL, Bisgrove BW, Gorsi B, Su YC, Yost HJ. (2013)  MMAPPR: Mutation Mapping Analysis Pipeline for Pooled RNA-seq. Genome Res [Epub ahead of print]. [abstract]

Incoming search terms:

  • RNA-seq 2013
  • owly
  • rna seq data analysis in r
  • dataset for mrna analysis for data mining
  • rnaseq data analysis r
  • miRNA-seq DATA NGS
  • transcriptome analysis workshop
  • computing requirements for rna seq
  • mirna-seq duplication
  • mirna-seq data analayis workshops - 2013

RNA-Seq

The measurement of RNA expression is a foundation of many experiments done in biomedical research. It is therefore natural that the sequencing of long and short RNA both for quantification and discovery is the most popular functional sequencing assay (Fig. 1a). Quantification of mRNA transcripts in RNA-seq is performed by calculating values reported in units of reads per kilobase per million mapped reads (RPKM) with a paired-end fragment equivalent, fragments per kilobase per million reads (FPKM), also commonly used for each gene (Fig. 1b). RPKM normalizes for differences in gene size and makes the comparison of genes within the same sample meaningful in terms of molar equivalents (Fig. 1b). As RNA-seq is not based on predetermined DNA probes to known genes, it is a powerful tool for the discovery of new exons, splice junctions, transcripts and genes as well as new small RNAs (Fig. 1c). The reads can be used to assemble transcripts that result from gene rearrangements and can also help to identify disease-associated genomic abnormalities. Properly filtered RNA-seq reads can be mined for sequence variants and RNA-editing events with tuned analysis and filtering pipelines (Fig. 1d).

  • Zeng W, Mortazavi A. (2012) Technical considerations for functional sequencing assays. Nat Immunol 13(9), 802-7. [abstract]

Incoming search terms:

  • ion torrent sequencing
  • ion torrent workflow
  • ion torrent sequencing workflow
  • ion torrent library preparation
  • rna seq flowchart
  • RNA sequencing flow chart
  • ion torrent chip
  • microarray data analysis
  • RNA sequencing ion torrent
  • ion torrent sequencing technology

Poll Results GraphIn our last reader poll, we asked: Do we yet have a firm handle on the most appropriate/accurate pipeline for analysis of RNA-Seq datasets?

The overwhelming result was: NO. N=72

We’re hearing feedback now from scientists developing data analysis methods for RNA-Seq.  Would very much like to hear from more of you.  Please send your comments to contribute@rna-seqblog.com.

by Dr. Raffaele A. Calogero, Bioinformatics and Genomics Unit,  MBC Centro di Biotecnologie Molecolari, Torino, Italy

Concerning the poll results I perfectly understand the frustration of the researchers. The reason why we started to optimize a miRNA-seq pipeline was due to the fact that we did not want to loose information simply because we did not use the right tools combination in the analysis pipeline. We simply applied to miRNA-seq the same approach used in the past for microarray data analysis.

Although RNA-seq seems to be an extremely powerful technique, only now we are getting information on bias and criticality. It is a time very similar to the beginning of the microarrays data analysis: a lot of excitement, a lot of new tools, but very little knowledge on the effect of tools integration in a pipeline.

In my opinion a keyword for pipeline optimization is “benchmark dataset”. An important step in the analysis of the 3′IVT Affymetrix came from the availability of spike-in experiments allowing the optimization of normalization/summarization algorithms (Irizzarry et al. PMID:18676452). Subsequently similar approaches were used for exon-arrays (Abdueva et al. PMID: 17878948; Della Beffa et al.  PMID:19040723). Also in our paper we took advantage of a miRNA-seq spike-in experiment (Willenbrock et al. PMID:19745027).

On the basis of previous works, it is evident that we desperately need experimental benchmark datasets to optimize RNA-seq pipelines. Some good work in this direction is due to Jiang et al. (PMID:21816910). However, I find very hard to think at experimental spike-in datasets that will be able to cover all possible steps of RNA-seq pipeline for isoform discovery or quantification. Maybe a combination of experimental spike-in with synthetic data (http://cbil.upenn.edu/BEERS/) might represent the way to evaluate the strength and limits of mRNA-seq pipelines.

Incoming search terms:

  • rna-seq data analysis pipeline
  • development pipeline mirna-seq
  • what can do rna-seq data
  • spiked rna-seq
  • rna spike-in
  • RNA spike in sequence
  • RNA sequencing spike in normalization
  • rna bowtie spike-in command
  • how to analysis mirnaseq
  • spikes in RNA sequencing

Optimized microRNA differential expression analysis workflow for digital data

Massive Parallel Sequencing methods (MPS) can extend and improve the knowledge obtained by conventional microarray technology, both for mRNAs and short non-coding RNAs, e.g. miRNAs. The processing methods used to extract and interpret the information are an important aspect of dealing with the vast amounts of data generated from short read sequencing. Although the number of computational tools for MPS data analysis is constantly growing, their strengths and weaknesses as part of a complex analytical pipe-line have not yet been well investigated.

Researchers at the Department of Computer Sciences, University di Torino, Italy set out to define a clear and simple analytical optimized workflow for miRNAs digital quantitative analysis.

They merged a publicly available MPS spike-in miRNAs data set with MPS data derived from healthy donor peripheral blood mononuclear cells to assemble a benchmark MPS miRNA dataset, resembling a situation in which miRNAs are spiked in biological replication experiments.

They observed that short reads counts estimation is strongly under estimated in case of duplicates miRNAs, if whole genome is used as reference. Furthermore, the sensitivity of miRNAs detection is strongly dependent by the primary tool used in the analysis. Within the six aligners tested, specifically devoted to miRNA detection, SHRiMP and MicroRazerS show the highest sensitivity. Differential expression estimation is quite efficient. Within the five tools investigated, two of them (DESseq, baySeq) show a very good specificity and sensitivity in the detection of differential expression.

  • Cordero F, Beccuti M, Arigoni M, Donatelli S, Calogero RA. (2012) Optimizing a Massive Parallel Sequencing Workflow for Quantitative miRNA Expression Analysis. PLoS One 7(2), e31630. [article]

Incoming search terms:

  • a cloud infrastructure for optimization of a massive parallel sequencing workflow
  • massive parallel sequencing workflow for quantitative miRNA expression analysis
  • massively parallel rna sequencing tuberculosis
  • massively parallel sequencing ppt
  • parallel sequencing RNA
  • parralel sequencing
  • RNA-Seq massive parallel sequencing

Currently available analysis tools are often not easily installed by the general biologist and most of them lack inherent parallel processing capabilities widely recognized as an essential feature of next-generation bioinformatics tools. Presented here is a user-friendly and fully automated RNA-Seq analysis pipeline (R-SAP) with built-in multi-threading capability to analyze and quantitate high-throughput RNA-Seq datasets. R-SAP follows a hierarchical decision making procedure to accurately characterize various classes of transcripts and achieves a near linear decrease in data processing time as a result of increased multi-threading. In addition, RNA expression level estimates obtained using R-SAP display high concordance with levels measured by microarrays.

R-SAP program is publicly available at www.mcdonaldlab.biology.gatech.edu/r-sap.htm.

  • Mittal VK, McDonald JF. (2012) R-SAP: a multi-threading computational pipeline for the characterization of high-throughput RNA-sequencing data. Nucleic Acids Res [Epub ahead of print]. [article]

Incoming search terms:

  • cummerbund class R rna seq
  • cummerbund multithread in r
  • htseq-count multithread
  • r-sap

  • Social Networking Pages

    Linkedin Group

  • Follow Me on Pinterest
  • RSS SEQanswers – RNA Sequencing

    • TopHat extremely low paired mapping rate. PLS HELP! May 22, 2013
      Hey guys, I have some problems with my paried-end RNA seq analysis on Galaxy. As you can see in the bam flagstat output, my tophat alignment rate is... […]
      Felix.Lee
    • Identifying small RNA sequence within whole genome sequence May 21, 2013
      Hi all, I want to know if there are any useful bioinformatic tool to find small RNA sequence within a whole bacteria genome. Thank you in... […]
      Inma
    • standard of clean data May 21, 2013
      Hi all I recently got my prokaryotes RNA-seq data report back. the standard filter steps of the raw data set by our local sequencing center is as... […]
      Pengfei Liu
    • Problem with cummeRbund diffData() May 20, 2013
      Hi all, I'm running Tophat/cufflinks/cuffdiff for differential gene expression and analysis with cummeRbund (v 2.0.0). I'm having an issue with... […]
      Enrique Zudaire
    • How to increase rowsize in heatmap? May 16, 2013
      Hi, I am a complete newbie to all things cummeRbund and am currently fighting with generating readable heatmaps. When I use ... […]
      Mags
    • novoalign mapping May 15, 2013
      Hi, I want to use novoalign to map reads - allowing up to 15 mismatches for 100 bp paired-end reads I am new to novoalign(went through the... […]
      abh
  • RSS Biostar – RNA-Seq

    • Why am I getting so many unmapped reads in STAR, classified as "too short"?
      I am currently using STAR to map several Hi-SEQ mRNA runs. I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :) In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome. The unmapped bin that most […]
    • What are the best practices for SNP identification in RNA seq transcriptome data
      I have 20 RICE RNA seq tranascriptome data hiseq 2000 platform paired end reads. I aligned fasta reads with BWA and remove PCR duplicates with PICARD. Later I call SNP with samtools using various parameters. I would like to clarify what parameters should I used while alinging to reference rice genome for looking SNP location 100 bp upstream and 250 bp downst […]
    • How do TopHat options -g , --supress-hits, and Bowtie options interplay?
      Hi, I am currently using TopHat2 to map RNA-seq runs. I think there have been some changes pertaining the -g option. Does anyone know how it works now? I used to think that setting -g would look for n alignments for a given read, report them [if top-scoring] and discard those reads that had more than g [top scoring] alignments. Now, the description sounds mo […]
    • What happened to -k in TopHat for multiple-mapping reads?
      Selecting -g n in tophat does not discard reads mapping more than n, but instead only reports n alignments for those out all all their TOP scoring alignments. I think there used to be an option -k that would allow one to discard reads that topped x alignments -- whatever happened to that? I only see -g in the tophat 2 manual, no reporting options like before […]
    • Does tophat use the library-type information for mapping, or just for the XS flag?
      When I specify library-type to TopHat, i.e., first-strand, second-strand, unstranded, TopHat appends a value + or - to the XS:A flag, which is useful for subsequent analyses, such as annotation. However, does this information actually influence the "mappability" of reads, or is this unaffected? My thinking is that the information would be considere […]
    • Purpose of Y-shaped adapters in Illumina Sequencing?
      Hi all, Y adapters different sequences to be annealed to the 5' and 3' ends of each molecule in a library. The arms of the Y are unique, and the middle part, connected to the DNA fragment, is complementary. What are the advantages of this? My take of this over having fully-complementary adapters (ADAPTER1 - - - - - ADAPTER1) is that: -Upon primer a […]