RNA-Seq is a powerful new technology to comprehensively analyze the transcriptome of any given cells. An important task in RNA-Seq data analysis is quantifying the expression levels of all transcripts. Although many methods have been introduced and much progress has been made, a satisfactory solution remains be elusive.
In this article, researchers from Peking University borrow the idea from the Positional Dependent Nearest Neighborhood (PDNN) model, originally developed for analyzing microarray data, to model the non-uniformity of read distribution in RNA-seq data. They propose a robust nonlinear regression model named PDEGEM, a Positional Dependent Energy Guided Expression Model to estimate the abundance of transcripts. Using real data, they find that the PDEGEM fits the data better than mseq in all three real datasets they tested. The researchers also find that the expression measure obtained using PDEGEM showed higher correlation with that obtained from alterative assays for quantifying gene and isoform expressions.
The stacking energy of PDEGEM in 8 different samples of Dataset 1. The x-axis represents the 16 dinucleotides AA, AC, …, and TT , while the y-axis indicates the stacking energies of the dinucleotides. Lines with different colors indicate different datasets. w1, w2 and w3 represent Wold data, b1, b2 and b3 stand for Burge Data, while g1 and g2 indicate Grimmond data.
Based on these results, the researchers believe that their PDEGEM can improve the accuracy in modeling and estimating the transcript abundance and isoform expression in RNA-Seq data. Additionally, although the stacking energy and positional weight of the PDEGEM are relatively related to sequencing platforms and species, they share some common trends, which indicates that the PDEGEM could partly reflect the mechanism of DNA binding between the template strain and the new synthesized read.
Availability – The PDEGEM model can be freely downloaded at: http://www.math.pku.edu.cn/teachers/dengmh/PDEGEM/