There certainly is a lot of excitement and much buzz surrounding RNA-Seq’s forecasted replacement of microarrays for gene expression analysis. (I wonder… who could be generating this hype?) From speaking to those interested in RNA-Seq for gene expression profiling, it seems there is somewhat of a frenzy, and a notion that RNA-Seq has some kind of magical power, so that the normal rules of good experimental design and practice don’t seem to apply here (i.e. need for replicate samples). This is possibly partly due to the facts that it is still so new and still very expensive. We did a quick scan of some recent reviews to put together the following points for discussion.
Advantages of RNA-Seq
- Benefits from the “digital” nature of counting sequence reads4
- Theoretically no limit to the dynamic range of detection1
- Not reliant on probes for targets that must be specified in advance, therefore no dependence on prior knowledge of the organism1,3,4
- Can go beyond merely quantifying gene expression – able to discover new complexities in the transcriptome
- transcription initiation sites, the cataloguing of sense and antisense transcripts, improved detection of alternative splicing events and the detection of gene fusion transcripts1
- allele specific expression, novel promoters and isoforms, fusion transcripts, RNA editing2
- discover new alternative transcription, unannotated transcription, measure transcription for non-coding regions3
Issues with RNA-Seq
- Still much more expensive – compared to arrays
- Datasets produced are large and complex. Data analysis and interpretation are not straightforward.
- Variability is seen in RNA-Seq datasets similar to what was first observed in microarrays. Mostly due to bias introduced during library prep (reverse transcription to cDNA)
- GC content of the sequence2,3
- the use of the random hexamer primers2
- 3’ and 5’ depletion or bias towards 3’-end2
- gene length3
- bias toward specific RNA species2,3
- differences in cDNA amplification efficiency3
- introduction of artefacts1
- Additional instrument variability – flow cell and/or lane variability
- Not immune to standard experimental variability – personnel, reagent lots, processing date, etc
- Sequencing depth
- small number of very highly expressed genes (7%) account for most of reads (75%) – therefore simple correlation coefficients can be misleading4
- the low count reads (less than a few reads) show low consistency between technical replicates and are therefore often excluded from analysis for comparisons across conditions2
Things to consider for experimental design
- RNA-Seq does not magically eliminate the need for basic, solid experimental design!
- Most of the experimental design principles still apply from microarray technologies2
- Randomization of samples is important, (assignment to conditions, prep order, location in instrument, etc.)
- Need for replicates is important
- Two types: Biological and Technical – difference is simple – If the goal is to evaluate the technology, use technical replicates; if the goal is to investigate the biological differences between conditions/tissues/treatments, biological replicates are essential2.
- The most desirable replicates are the biological replicates, which are true replicates and provide us the variation among biological samples2.
- It makes my skin crawl when I read in a publication that samples were pooled to reduce variation!
- Eliminating biological replicates creates statistically suboptimal experiments.
- Normalization is important – to allow the combination or the comparison of RNA-Seq runs
- Commonly observed data distortions demonstrate the need for appropriate data normalization2,3,4.
- Calculating gene expression by simple RPKM (number of reads per kilobase of exonic sequence per million mapped reads) – is not always an appropriate solution to deal with variability3.
Many of the issues can be dealt with:
- Biases of NGS can be mitigated by increasing sample size and/or sequencing depth.
- Simple differential expression between polyA+ sample – may require only 30M paired-end reads > 30NT5
- To detect differential expression of weakly expressed genes or low copy number transcripts/isoforms – you simply need a lot of reads – 100-200M paired-end reads > 76NT or longer5
- Paired end sequencing – at the same sequencing depth, pair-end sequencing increases the sensitivity and specificity of the detection of the alternative splicing and chimeras in comparison with the single end sequencing2.
- Be careful to select the right analysis methods based on your experiment design and goals. There are many, many tools out there for alignment, mapping, normalization, and quantitfication. The strategy you choose will have significant impact on your results.
- Validate your results with another method and/or other samples2.
- Ozsolak F, Milos PM. (2011) RNA sequencing: advances, challenges and opportunities. Nat Rev Genetics 12, 87-98. [abstract]
- Fang Z, Cui X. (2011) Design and validation issues in RNA-seq experiments. Brief Bioinform. 12(3), 280-87. [abstract]
- Hansen KD, Irizarry RA , and Wu Z, (2011) Removing Technical Variability In Rna-Seq Data Using Conditional Quantile Normalization. Collection of Biostatistics Research Archive. [article]
- Labaj PP, Leparc GG, Linggi BE, Markillie LM, Wiley HS, Kreil DP. (2011) Characterization and improvement of RNA-Seq precision in quantitative transcript expression profiling. Bioinformatics 27(13), i383-91. [abstract]
- The ENCODE Consortium (2011) Standards, Guidelines and Best Practices for RNA-Seq V1.0. [document]