Next-generation RNA sequencing technologies have been widely applied in transcriptome profiling. This facilitates further studies of gene structure and expression on the genome wide scale. It is an important step to align reads to the reference genome and call out splicing junctions for the following analysis, such as the analysis of alternative splicing and isoform construction. However, because of the existence of introns, when RNA-seq reads are aligned to the reference genome, reads can not be fully mapped at splicing sites. Thus, it is challenging to align reads and call out splicing junctions accurately.
In this paper, researchers from the University of Connecticut present a classification based approach for calling splicing junctions from RNA-seq data, which is implemented in the program SpliceJumper. SpliceJumper uses a machine learning approach which combines multiple features extracted from RNA-seq data. They compare SpliceJumper with two existing RNA-seq analysis approaches, TopHat2 and MapSplice2, on both simulated and real data. Their results show that SpliceJumper outperforms TopHat2 and MapSplice2 in accuracy.
Sequence reads from 3,032,644 to 3,036,223 on chromosome 11 of simulated Test1 dataset. A benchmarked splicing junction from 3,033,210 to 3,035,657 (connected by dash lines) is called out by SpliceJumper, but missed by both TopHat2 and MapSplice2. 1-9 are nine splicing sites. A, B, and C are three split-mapped reads (only mapped part shown) that are clipped at splicing site 8, and the clipped segment can be aligned at site 1. D is clipped at splicing site 1. The clipped segment can be mapped at site 8. A, B, C, and D and their mate reads form four discordant pairs encompassing splicing site 1 and site 8.
Availability – The program SpliceJumper can be downloaded at https://github.com/Reedwarbler/SpliceJumper.
Chu C, Li X, Wu Y. (2015) SpliceJumper: a classification-based approach for calling splicing junctions from RNA-seq data. BMC Bioinformatics 16 Suppl 17:S10. [article]