from Getting Genetics Done by Stephen Turner
RNA-seq has been performed routinely for at least 5+ years, yet there is no consensus on the best methodology for analyzing this data. For example, Eduardo Eyras’s group recently posted a pre-print on methods to study splicing from RNA-seq, where this great figure was shown:
This illustrates the problem clearly: ignoring alternative workflows that simply count reads mapping to genes or exons and doing a negative binomial-based test, there are thousands of potential paths through an isoform-resolution RNA-seq analysis.
While the gene/exon-count based approach is simpler and arguably more powerful and well-characterized, there are numerous potential problems with this approach, as outlined in the recent cuffdiff 2 paper:
|Reproduced from the cuffdiff2 paper under fair use.|
I’m not going to go into the merits of feature-count methods versus transcript deconvolution methods – that discussion is best settled by others.
But if you have the coverage needed for isoform reconstruction, perhaps the most commonly-trodden path through the transcript-resolution differential expression analysis is using TopHat for alignment, Cufflinks for assembly, and Cuffdiff for quantitation and differential expression analysis, as described in this recent protocol.
However, if you’ve ever used cufflinks with lots of samples or lots of reads, you’ve noted the exponential increase in computational resources necessary to run the analysis. This problem, as well as the performance of an alternative approach, is illustrated in Fig. 2b in the recent publication about a new tool from Lior Pachter’s lab, the un-Google-ably named eXpress, for streaming real-time fragment assignment of RNA-seq data:
|Reproduced from the eXpress paper under Fair Use.|
I won’t attempt to explain here how eXpress works other than to tell you it’s EM-based, and to direct you to the paper for a fuller explanation. As you can see from the figure above, the resource requirement with more samples and higher coverage increases linearly, consuming only slightly more RAM than the UNIX wc (word count) command, all the while maintaining accuracy comparable to or slightly better than existing state-of-the-art methods like Cufflinks or RSEM.
So, what’s the hold-up? Why isn’t everyone using eXpress for differential gene/transcript expression with RNA-seq data? Personal preferences and allegiances aside, part of the reason might be because eXpress’s estimated fragment abundances are not true counts, and cannot be statistically treated as such. That is, you can’t simply take the transcript abundances you get out of eXpress and throw them into a count-based test like those implemented in edgeR or DESeq, and expect those results to be robust and accurate. What’s lacking is a mathematical framework (and a user-friendly R or other software package) for conducting statistical analysis of differential abundance across multiple samples of transcript abundances as estimated by eXpress.
I ran into Lior at the CSHL Genome Informatics meeting this morning, and pressed him on when we might see an R package for statistically analyzing eXpress-estimated isoform abundances, and I was told we would see something within a month. I’m going to hold you to that, Lior, so keep an eye out on Bioconductor and Lior’s blog for the much needed and long-awaited statistical framework and R package to do this analysis.
#GI2013 folks, I’ll see you at the poster session and reception. And to everyone else, as always, keep calm and sequence on.