Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate

The high-throughput sequencing technology, RNA-Seq, has been widely used to quantify gene and isoform expression in the study of transcriptome in recent years. Accurate expression measurement from the millions or billions of short generated reads is obstructed by difficulties.

One is ambiguous mapping of reads to reference transcriptome caused by alternative splicing. This increases the uncertainty in estimating isoform expression.

The other is non-uniformity of read distribution along the reference transcriptome due to positional, sequencing, mappability and other undiscovered sources of biases. This violates the uniform assumption of read distribution for many expression calculation approaches, such as the direct RPKM calculation and Poisson-based models.

Many methods have been proposed to address these difficulties. Some approaches employ latent variable models to discover the underlying pattern of read sequencing.

However, most of these methods make bias correction based on surrounding sequence contents and share the bias models by all genes. They therefore cannot estimate gene- and isoform-specific biases as revealed by recent studies.

Researchers at Nanjing University of Aeronautics and Astronautics have developed a latent variable model, NLDMseq, to estimate gene and isoform expression.

This method adopts latent variables to model the unknown isoforms, from which reads originate, and the underlying percentage of multiple spliced variants. The isoform- and exon-specific read sequencing biases are modeled to account for the non-uniformity of read distribution, and are identified by utilizing the replicate information of multiple lanes of a single library run.

rna-seq

Graphic model representation of NLDMseq. The while circles represent latent variables, the large black circle for the observed exon and the small balck circles for hyperparameters. The plates denote replication of the random variables

The researchers employ simulation and real data to verify the performance of theirmethod in terms of accuracy in the calculation of gene and isoform expression. Results show that NLDMseq obtains competitive gene and isoform expression compared to popular alternatives.

Availability – The method has been implemented as a freely available software which can be found at https://github.com/PUGEA/NLDMseq

Liu X, Shi X, Chen C, Zhang L. (2015) Improving RNA-Seq expression estimation by modeling isoform- and exon-specific read sequencing rate. BMC Bioinformatics 16:332. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.