The advent of next-generation RNA sequencing (RNA-seq) has greatly advanced transcriptomic studies, including system-wide identification and quantification of mRNA isoforms under various biological conditions. A number of computational methods have been developed to systematically identify mRNA isoforms in a high-throughput manner from RNA-seq data. However, a common drawback of these methods is that their identified mRNA isoforms contain a high percentage of false positives, especially for genes with complex splicing structures, e.g., many exons and exon junctions.
Researchers from the University of California, Berkeley have developed a preselection method called “Non-negative Matrix Factorization Preselection” (NMFP) which is designed to improve the accuracy of computational methods in identifying mRNA isoforms from RNA-seq data. They demonstrated through simulation and real data studies that NMFP can effectively shrink the search space of isoform candidates and increase the accuracy of two mainstream computational methods, Cufflinks and SLIDE, in their identification of mRNA isoforms.
a Diagram of the NMFP method. b Illustration of the NMF approach in NMFP
NMFP is a useful tool to preselect mRNA isoform candidates for downstream isoform discovery methods. It can greatly reduce the number of isoform candidates while maintaining a good coverage of unknown true isoforms. Adding NMFP as an upstream step, computational methods are expected to achieve better accuracy in identifying mRNA isoforms from RNA-seq data.
Isoform discovery performance at the nucleotide level.
a Precision rates, b Recall rates, and c F scores of the identified isoforms by seven methods (Cufflinks, NMFP + Cufflinks, SLIDE(fewer), NMFP + SLIDE(fewer), SLIDE(more), NMFP + SLIDE(more), and NMFP) on 50 simulated RNA-seq data sets of D. melanogaster
Availability – The NMFP source codes and examples are available at http://www.stat.ucla.edu/~jingyi.li/packages/NMFP.zip.