IsoEM – a novel algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data

IsoEM is a novel expectation-maximization algorithm for inference of isoform- and gene-specific expression levels from RNA-Seq data. The IsoEM package can be used to infer isoform and gene expression levels from high-throughput transcriptome sequencing (RNA-Seq) data. IsoEM uses a novel expectation-maximization algorithm that exploits read disambiguation information provided by the distribution of insert sizes generated during sequencing library preparation, and takes advantage of base quality scores, strand, and read pairing information (if available).

The open source Java implementation of IsoEM is freely available at http://dna.engr.uconn.edu/software/IsoEM/.

Massively parallel whole transcriptome sequencing, commonly referred as RNA-Seq, is quickly becoming the technology of choice for gene expression profiling. However, due to the short read length delivered by current sequencing technologies, estimation of expression levels for alternative splicing gene isoforms remains challenging.

Empirical experiments on both synthetic and real RNA-Seq datasets show that IsoEM has scalable running time and outperforms existing methods of isoform and gene expression level estimation.

Simulation experiments confirm previous findings that, for a fixed sequencing cost, using reads longer than 25-36 bases does not necessarily lead to better accuracy for estimating expression levels of annotated isoforms and genes.

Nicolae M, Mangul S, Mandoiu II, Zelikovsky A. (2011) Estimation of alternative splicing isoform frequencies from RNA-Seq data. Algorithms Mol Biol 6(1), 9. [article]