RNA-Seq is a powerful technology for transcriptome analysis that is predicted to replace microarrays. Using second generation sequencing technology, millions of (relatively) short reads are sequenced from RNA samples. By analyzing these reads, more accurate estimation of both gene and isoform expression levels can be obtained. However, we need to conquer several computational challenges before we can obtain such estimation. One critical challenge is how to deal with reads that map to multiple locations.
We propose a generative probabilistic model of sequencing process to handle this challenge. The corresponding algorithm, RSEM(RNA-Seq by Expectation Maximization) is the first algorithm that handles both gene level and isoform level multireads in a statistically well founded way. Our simulation results show that RSEM has superior or comparable quantification accuracy to other currently available methods.
Using RSEM, we evaluate that, given a fixed sequencing throughput, if longer reads and paired-end reads can provide better accuracy than short reads and single-end reads. The simulation results suggest that in fact short reads and single-end reads are better for a fixe throughput, which is contrary to the common sense in the community. We also find that quality scores provide little additional information for improving quantification accuracy. Our findings have the potential of guiding RNA-Seq experimental design and technology development.
RSEM package is publicly available at http://deweylab.biostat.wisc.edu/rsem.