Cheaper RNA-seq: but you might have to give up strandeness and splicing

from CoreGenomics by James Hadfield

I’m going to start this post by saying something that may be controversial “splicing analysis is a niche, strandedness is useful in a minority of cases and that most RNA-seq users are only interested in getting Affy-like 3′ biased differential gene expression data”.

I’m constantly looking at how we perform RNA-seq analysis for differential gene expression (DGE) analysis with the major goal of making it cheap enough that replication levels increase from the commonly used 3 replicates to a more powerful 6+ replicates. Obviously users are put off doubling the cost of their experiment when everyone’s been running triplicate as the maximum level of replication for the last ten years or more, so ideally we’ll keep experimental costs the same while trying to improve the quality.

Now I do believe RNA-seq has made a dramatic difference to the cost of performing DGE experiments and is way cheaper than Affy arrays used to be. But a side-effect of moving over to RNA-seq is that most scientists are very interested in science and read widely so when they see papers exclaiming all the possibilities RNA-seq brings they’d like to get as much as possible from their own RNA-seq experiments.

What can you do with RNA-seq: i) differential gene expression in an (almost) unbiased manner, ii) analyse splicing isoforms using splice-junction reads, iii) identify the full extent of the transcriptome including non-coding, antisense and over-lapping transcripts often making use of strand information.

The challenge with the first statement is that there is always bias experiments, just because RNA-seq does not suffer from the problems microarrays has/had; of sensitivity, reliance on an annotated transcriptome, limitations of design/use of oligonucleotide probes, etc, does not mean RNA-seq isn’t biased. The choice of time-point of cell-type in any experiment adds an immediate, although hopefully informed, bias, RNA extraction methodology will have an impact and RNA-seq library preparation, and sequencing will add other layers of bias on top.

Statement two gets many biologists salivating. We’ve all read some fantastic papers showing the biological importance of splicing and would like to be able to show similar results in the systems we’re studying. However splicing analysis is limited by the fact that today we can only sequence through a single splice junction in each read meaning we need to infer which isoforms might be present, and if there is differential isoform usage. The competing methods of analysis work in quite different ways: cufflinks use of the Bowtie+TopHat spliced alignment versus DEX-seq’s analysis of differential exon usage by couting reads in individual exons.

Statement three has been massively impacted by the introduction of stranded RNA-seq protocols. Methods that preserve the information about which strand the transcript came from have allowed an even more detailed view of the transcriptome than we have ever had especially in the cases of transcripts that overlap in genome location and the extent and biological importance of antisense transcription.

What does this have to do with making RNA-seq cheaper: Unfortunately ii) and iii) add significant complexity to the design, cost and analysis of RNA-seq experiments. The impact of splicing analysis is relatively simple to explain to users, we need many more reads and this increases the cost of the experiment in a linear fashion and users can decide if the splicing analysis is “worth” it. But the impact of getting strand information is less easy to explain (certainly for me in mammalian systems), very few of us are strand-savvy and we’re most likely to ignore the stand information other than making nicer figures for presentations to show how well the method worked. Strand information does not come free-of-charge, the most popular method for stranded RNA-seq uses dUTP and uracil-DNA glycosylase (UDG) to remove the second-strand cDNA after ligation of sequencing adapters. Strand-preserving methods cost more to manufacture than non-stranded methods and hence cost more for end users.

The cost of RNA-seq in 2014: I recommend 10-20M single-end 50bp reads per sample for DGE studies and that allows 10 samples per lane on a HiSeq 2500/2000. At about £600 per lane this makes sequencing £60 per sample. Library prep reagents and labour are about £35 each, making RNA-seq one of the few NGS applications where sequencing costs are lower than sample prep.

Ideally we’d maintain a 10 fold difference between sequencing and library prep, but this would mean bringing prep down to under £15 or $10 per sample. Is it possible to squeeze oligo-dT mRNA enrichment, cDNA synthesis and an Illumina prep into this budget? I’m not sure this is possible outside of massive labs that can afford to make their own kits and buy enzymes in bulk.

We’re thinking about how we might approach RNA-seq prep to keep costs as low as possible. Home-brew is certainly going to get us some way to $10 per prep, but some very creative thinking may be necessary. Would you give up strandedness for an RNA-seq kit that gave perfectly good DGE data but cost half as much as the leading stranded protocol?

Would it make much difference to your next experiment: not if you once you think carefully about your experiment you can clearly state that splicing and strandedness are not going to be primary areas for analysis. I do think most scientists don’t have these uppermost in their minds when embarking on RNA-seq experiments and a quick search in PubMed might back me up: there are 3148 hits in PubMed for RNA+Seq or 2359 hits for “RNA-seq”, but only 3% (73) come up with the terms RNA+seq+strand+specific, so strandedness is not yet massive, which was a surprise to me given the reception Josh Levin’s 2010 Nat Methods paper: Comprehensive comparative analysis of strandspecific RNA sequencing methods got.

So here’s my plea to RNA-seq users: think about what you really want to get from your RNA-seq experiments and only aim for “expensive” splicing experiments with 20-50M reads per sample if you really need to, think about whether you’d give up other features to get faster, cheaper RNA-seq experiments; and save the money and effort for the experiment that needs it.

Source: CoreGenomics