rna-seq pipelineRecent advances in high-throughput cDNA sequencing (RNA-seq) can reveal new genes and splice variants and quantify expression genome-wide in a single assay. The volume and complexity of data from RNA-seq experiments necessitate scalable, fast and mathematically principled analysis software. TopHat and Cufflinks are free, open-source software tools for gene discovery and comprehensive expression analysis of high-throughput mRNA sequencing (RNA-seq) data. Together, they allow biologists to identify new genes and new splice variants of known ones, as well as compare gene and transcript expression under two or more conditions.

This protocol describes in detail how to use TopHat and Cufflinks to perform such analyses. It also covers several accessory tools and utilities that aid in managing data, including CummeRbund, a tool for visualizing RNA-seq analysis results. Although the procedure assumes basic informatics skills, these tools assume little to no background with RNA-seq analysis and are meant for novices and experts alike. The protocol begins with raw sequencing reads and produces a transcriptome assembly, lists of differentially expressed and regulated genes and transcripts, and publication-quality visualizations of analysis results. The protocol’s execution time depends on the volume of transcriptome sequencing data and available computing resources but takes less than 1 d of computer time for typical experiments and ~1 h of hands-on time.

  • Trapnell C, Roberts A, Goff L, Pertea G, Kim D, Kelley DR, Pimentel H, Salzberg SL, Rinn JL, Pachter L. (2012) Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks. Nat Protoc 7(3), 562-78. [article]

Incoming search terms:

  • Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks
  • rna-seq mapping
  • cufflinks rna-seq
  • cufflinks next generation sequencing
  • cufflinks software
  • tophat rna
  • rna-seq expression analysis
  • rnaseq tutorial
  • differential expression analysis
  • differential gene and transcript expression analysis

Comments

4 Responses to “Protocol – Differential gene and transcript expression analysis of RNA-seq experiments with TopHat and Cufflinks”

  1. yu on March 8th, 2012 3:26 pm

    a nice paper, but:
    1: 6 fastq files are used as examples, TopHat cannot process C1_R2_* and C2_R2_*;
    2: there is a typo in cuffdiff command;
    3: got a error message when running cummeRbund.

    Emailed author, no responses yet.

    > cuff_data <- readCufflinks('diff_out')
    Creating database diff_out/cuffData.db
    Reading diff_out/genes.fpkm_tracking
    Checking samples table…
    Populating samples table…
    Writing genes table
    Reshaping geneData table
    Recasting
    Writing geneData table
    Error in sqliteExecStatement(con, statement, bind.data) :
    RS-DBI driver: (unable to bind data for parameter ':status')

  2. yu on March 10th, 2012 10:46 am

    Well, got response from author:
    my first point is right. The latest version of TopHat cannot handle C1_R2_* and C2_R2_*;

    My other two points are not correct.
    there is NOT typo in cuffdiff command in their paper. You have to be very careful the “comma” among the file names.

  3. colin on March 20th, 2012 4:25 am

    i have the same error when using cummeRbund. did you get an answer from the authors related to that point or did you managed to make it work?
    thx

  4. yu on March 25th, 2012 9:51 am

    With help from the author, I figured it out:

    the error from cummeRbund is due to the incorrect usage of cuffdiff.
    Be careful with the cuffdiff command, especially the “comma”.

    Wrong one:
    cuffdiff -o diff_out -b genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam ./C1_R3_thout/accepted_hits.bam ./C2_R1_thout/accepted_hits.bam ./C2_R3_thout/accepted_hits.bam

    Correct one:
    cuffdiff -o diff_out -b genome.fa -p 8 -L C1,C2 -u merged_asm/merged.gtf ./C1_R1_thout/accepted_hits.bam,./C1_R3_thout/accepted_hits.bam ./C2_R1_thout/accepted_hits.bam,./C2_R3_thout/accepted_hits.bam

Leave a Reply




  • Social Networking Pages

    Linkedin Group

  • Follow Me on Pinterest
  • RSS SEQanswers – RNA Sequencing

    • RNAseq (SOLiD) from 18 - 200 nt June 18, 2013
      We are interested in small non-coding RNAs. Whomever you ask about the size range of small RNAs, you get a different answer. ;) Lets assume, small... […]
      GenomicIBK
    • Unmapped ratio very high on mouse genome June 17, 2013
      Hi, My problem regards RNA-Seq data. I've downloaded public data (SAGE libs w/ 6 different samples from mouse liver ) to analyse using ArrayStudio.... […]
      le.nono
    • RNASeq: Read length different from expected June 17, 2013
      Hello all, I have received paired-end reads for 40 samples. The reads are supposed to be 100bp per end. Instead, 20 of my samples are 101bp per... […]
      gogodidi
    • How to install xgawk June 16, 2013
      Hi, This is Shrujan, i have a problem while running RNA Sequencing QC. It shows an error that xgawk is not found. So please help me installing... […]
      shrujan
    • RNA Sequencing QC Error while using with Sequence_QC.sh file June 15, 2013
      Hi, This is Shrujan kumar Madadha, I had an error while running QC for Drosophila Yukuba fastq RNA file using Sequence_QC.sh file of FASTX... […]
      shrujan
    • Cuffmerge related query June 12, 2013
      I have a query regarding what samples should be merged using cuffmerge, when you have multiple phenotypes (each with replicates). Lets say my mouse... […]
      ParthavJailwala
  • RSS Biostar – RNA-Seq

    • edgeR: very low p-value and very high variance within the group of replicates. What's my problem??
      I'm using edgeR in order to perform differential expression analysis from RNA-seq experiment. I have 6 samples of tumor cell, same tumor and same treatment: 3 patient with good prognosis and 3 patient with bad prognosis. I want to compare the gene expression among the two groups. I ran the edgeR pakage like follow: x […]
    • Normalising tag count to RPKM
      Hi! I was wondering if their is a way to normalise the number of reads in a region and the RPKM of the nearest gene to that region, so that a correlation could be computed. Like the following data shows number of tags in first column and RPKM in second column Tags RPKM 15 0.14619 11 0 203 0.2259 129 10.701 300 7.0772 122 2.3234 346 10.666 77 3.117 201 16.749 […]
    • a simple question on RNA-Seq terminology
      This question may be very simple and basic, but I just need to confirm that I understand the differences among those terminologies in the RNA-Seq context. Suppose I have a sample called SLR, and it is sequenced on 5 lanes, so I have (among other output files) BAM files like L1_SLR, L2_SLR, L3_SLR, L5_SLR and L7_SLR.bam. Here, the letter "L" denotes […]
    • FInding regions of interest with minimum coverage
      Hi, I have a bam file of all my accepted hits (tophat output) and an gtf file with my genes of interest for which I am trying to find potential antisense transcripts. I would like to create a list - preferably one that can be visualized in a genome browser - that shows all genes that have antisense reads in the accepted hits.bam file provided that there are […]
    • How to remove the intronic reads before counting
      I got RNASeq data in several samples. I checked the FastQC, seems the read quality are good (Hiseq 2000). But the problem is many reads are mapped to intronic region, and the regions have no any reference exons there (Refseq, ensembl, gencode). We don't know what they are. We guess the problem happend in library preparation, the concentration was low. N […]
    • Which strand of the mRNA molecule does the sequencer output as a "read"?
      In Illumina Stranded RNA-Seq (using the dUTP method), do the final reads in the fastq files correspond to the initial molecule (that was transcribed), or to the reverse complement of the molecule? C […]