There certainly is a lot of excitement and much buzz surrounding RNA-Seq’s forecasted replacement of microarrays for gene expression analysis. (I wonder… who could be generating this hype?) From speaking to those interested in RNA-Seq for gene expression profiling,  it seems there is somewhat of a frenzy, and a notion that RNA-Seq has some kind of magical power, so that the normal rules of good experimental design and practice don’t seem to apply here (i.e. need for replicate samples).  This is possibly partly due to the facts that it is still so new and still very expensive. We did a quick scan of some recent reviews to put together the following points for discussion.

Advantages of RNA-Seq

  • Benefits from the “digital” nature of counting sequence reads4
  • Theoretically no limit to the dynamic range of detection1
  • Not reliant on probes for targets that must be specified in advance, therefore no dependence on prior knowledge of the organism1,3,4
  • Can go beyond merely quantifying gene expression – able to discover new complexities in the transcriptome
    • transcription initiation sites, the cataloguing of sense and antisense transcripts, improved detection of alternative splicing events and the detection of gene fusion transcripts1
    • allele specific expression, novel promoters and isoforms, fusion transcripts, RNA editing2
    • discover new alternative transcription, unannotated transcription, measure transcription for non-coding regions3

Issues with RNA-Seq

  • Still much more expensive – compared to arrays
  • Datasets produced are large and complex. Data analysis and interpretation are not straightforward.
  • Variability is seen in RNA-Seq datasets similar to what was first observed in microarrays. Mostly due to bias introduced during library prep (reverse transcription to cDNA)
    • GC content of the sequence2,3
    • the use of the random hexamer primers2
    • 3’ and 5’ depletion or bias towards 3’-end2
    • gene length3
    • bias toward specific RNA species2,3
    • differences in cDNA amplification efficiency3
    • introduction of artefacts1
  • Additional instrument variability – flow cell and/or lane variability
  • Not immune to standard experimental variability – personnel, reagent lots, processing date, etc
  • Sequencing depth
    • small number of very highly expressed genes (7%) account for most of reads (75%) – therefore simple correlation coefficients can be misleading4
    • the low count reads (less than a few reads) show low consistency between technical replicates and are therefore often excluded from analysis for comparisons across conditions2

Things to consider for experimental design

  • RNA-Seq does not magically eliminate the need for basic, solid experimental design!
  • Most of the experimental design principles still apply from microarray technologies2
  • Randomization of samples is important, (assignment to conditions, prep order, location in instrument, etc.)
  • Need for replicates is important
    • Two types: Biological and Technical – difference is simple – If the goal is to evaluate the technology, use technical replicates; if the goal is to investigate the biological differences between conditions/tissues/treatments, biological replicates are essential2.
    • The most desirable replicates are the biological replicates, which are true replicates and provide us the variation among biological samples2.
    • It makes my skin crawl when I read in a publication that samples were pooled to reduce variation!
    • Eliminating biological replicates creates statistically suboptimal experiments.
  • Normalization  is important – to allow the combination or the comparison of RNA-Seq runs
    • Commonly observed data distortions demonstrate the need for appropriate data normalization2,3,4.
    • Calculating gene expression by simple RPKM (number of reads per kilobase of exonic sequence per million mapped reads) – is not always an appropriate solution to deal with variability3.

Many of the issues can be dealt with:

  • Biases of NGS can be mitigated by increasing sample size and/or sequencing depth.
    • Simple differential expression between polyA+ sample – may require only 30M paired-end reads > 30NT5
    • To detect differential expression of weakly expressed genes or low copy number transcripts/isoforms  – you simply need a lot of reads – 100-200M paired-end reads > 76NT or longer5
  • Paired end sequencing – at the same sequencing depth, pair-end sequencing increases the sensitivity and specificity of the detection of the alternative splicing and chimeras in comparison with the single end sequencing2.
  • Be careful to select the right analysis methods based on your experiment design and goals. There are many, many tools out there for alignment, mapping, normalization, and quantitfication.  The strategy you choose will have significant impact on your results.
  • Validate your results with another method and/or other samples2.

 

  1. Ozsolak F, Milos PM. (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genetics 12, 87-98. [abstract]
  2. Fang Z, Cui X. (2011) Design and validation issues in RNA-seq experiments. Brief Bioinform. 12(3), 280-87. [abstract]
  3. Hansen KD, Irizarry RA , and Wu Z, (2011) Removing Technical Variability In Rna-Seq Data Using Conditional Quantile Normalization. Collection of Biostatistics Research Archive. [article]
  4. Labaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP. (2011) Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27(13), i383-91. [abstract]
  5. The ENCODE Consortium (2011) Standards, Guidelines and Best Practices for RNA-Seq V1.0. [document]

Incoming search terms:

  • removing technical variability in rna-seq data using conditional quantile normalization
  • what is a run in RNA seq
  • hansen 2011 ngs rnaseq
  • standards guidelines and best practices for rna-seq encode
  • running replicate samples tophat rna-seq
  • rnaseg pooling samples
  • rna-seq technical variability and sampling
  • rna-seq sample pooling
  • rna-seq pooling samples
  • RNA-seq microrna normalization

Comments

4 Responses to “The Magic of RNA-Seq”

  1. Transcription on July 8th, 2011 3:14 am

    Theoretically no limit to the dynamic range of detection. Randomization of samples is important. thank you for more information is very well.

  2. woodylin on July 10th, 2011 11:54 pm

    “It makes my skin crawl when I read in a publication that samples were pooled to reduce variation!”
    I have ever read one publication using pooled sample of RNA-Seq to discuss splicing noise, was that one the case?

  3. RNA-Seq Blog Poll Results | RNA-Seq Blog on November 16th, 2012 5:22 pm

    [...] Given all that RNA-Seq is capable of providing for us, it is interesting to me that most of you are using RNA-Seq for gene expression analysis; something for which microarrays and PCR have proven more than adequate to provide for us for the past two decades.  However, since this is the case, we thought it would be beneficial to provide some info.  (See post below – http://rna-seqblog.com/information/the-magic-of-rna-seq/) [...]

  4. Bret on April 19th, 2013 11:02 pm

    Clearly written. In regards Biological noise as a source of variation. RNA-seq gives you greater depth than a microarray. So, does this mean that the ability to detect biological noise is increased? Is there another way besides increased biological replicates to account for this? A bit tongue-in-cheek. But, this is a relevant experimental design consideration.

Leave a Reply




  • Social Networking Pages

    Linkedin Group

  • Follow Me on Pinterest
  • RSS SEQanswers – RNA Sequencing

    • DESeq; can I omit timepoints during dispersal estimation? May 24, 2013
      I have a bacterial timecourse with 2 biological replicates per timepoint. There is a fair bit of variance between my replicates. I have spent the... […]
      amcloon
    • HT Seq Count stranded options May 24, 2013
      I am very new to bioinformatics, so I would be really grateful for some help! I have been using *HTSeq Count v0.5.3* and I am bit confused about... […]
      qwrissie
    • Tophat 2.0.8b installation error May 24, 2013
      I install tophat-2.0.8b to rerun the mapping. but when i make it, the error appears like this. make[1]: Entering directory... […]
      canhu
    • reason for low mapping rate?? May 23, 2013
      we did RNASeq using HiSeq 2000 100PE. When the data were back, I mapping them to the reference sequence, but got very low mapping rate (30-40%). I... […]
      miaom
    • cross-species data - questions about normalization May 23, 2013
      Hi, I have some data form various samples (cell types) in different species. I want to compare and analyze gene expression variability across the... […]
      trelek2
    • CuffDiff strange output May 23, 2013
      Hi, I hope that someone can be so gentle to help me. I'm analizing some data from RNA-Seq with TopHat and Cufflinks and I focus my attention on... […]
      Pruexel
  • RSS Biostar – RNA-Seq

    • Why am I getting so many unmapped reads in STAR, classified as "too short"?
      I am currently using STAR to map several Hi-SEQ mRNA runs. I'm having trouble getting a decent amount of reads to map, but I don't really understand why. I'm hoping you can shed some light :) In the final log, only about 50% (or less) of the reads map to the reference. I'm using a GTF in addition to the genome. The unmapped bin that most […]
    • What are the best practices for SNP identification in RNA seq transcriptome data
      I have 20 RICE RNA seq tranascriptome data hiseq 2000 platform paired end reads. I aligned fasta reads with BWA and remove PCR duplicates with PICARD. Later I call SNP with samtools using various parameters. I would like to clarify what parameters should I used while alinging to reference rice genome for looking SNP location 100 bp upstream and 250 bp downst […]
    • How do TopHat options -g , --supress-hits, and Bowtie options interplay?
      Hi, I am currently using TopHat2 to map RNA-seq runs. I think there have been some changes pertaining the -g option. Does anyone know how it works now? I used to think that setting -g would look for n alignments for a given read, report them [if top-scoring] and discard those reads that had more than g [top scoring] alignments. Now, the description sounds mo […]
    • What happened to -k in TopHat for multiple-mapping reads?
      Selecting -g n in tophat does not discard reads mapping more than n, but instead only reports n alignments for those out all all their TOP scoring alignments. I think there used to be an option -k that would allow one to discard reads that topped x alignments -- whatever happened to that? I only see -g in the tophat 2 manual, no reporting options like before […]
    • Does tophat use the library-type information for mapping, or just for the XS flag?
      When I specify library-type to TopHat, i.e., first-strand, second-strand, unstranded, TopHat appends a value + or - to the XS:A flag, which is useful for subsequent analyses, such as annotation. However, does this information actually influence the "mappability" of reads, or is this unaffected? My thinking is that the information would be considere […]
    • Purpose of Y-shaped adapters in Illumina Sequencing?
      Hi all, Y adapters different sequences to be annealed to the 5' and 3' ends of each molecule in a library. The arms of the Y are unique, and the middle part, connected to the DNA fragment, is complementary. What are the advantages of this? My take of this over having fully-complementary adapters (ADAPTER1 - - - - - ADAPTER1) is that: -Upon primer a […]