Alternative splicing (AS) is an essential post-transcriptional mechanism that regulates many biological processes. However, identifying comprehensive types of AS events without guidance from a reference genome is still a challenge. Researchers at the Beijing Normal University have developed a novel method, MkcDBGAS, to identify all seven types of AS events using transcriptome alone, without a reference genome. MkcDBGAS, modeled by full-length transcripts of human and Arabidopsis thaliana, consists of three modules. In the first module, MkcDBGAS, for the first time, uses a colored de Bruijn graph with dynamic- and mixed- kmers to identify bubbles generated by AS with precision higher than 98.17% and detect AS types overlooked by other tools. In the second module, to further classify types of AS, MkcDBGAS added the motifs of exons to construct the feature matrix followed by the XGBoost-based classifier with the accuracy of classification greater than 93.40%, which outperformed other widely used machine learning models and the state-of-the-art methods. Highly scalable, MkcDBGAS performed well when applied to Iso-Seq data of Amborella and transcriptome of mouse. In the third module, MkcDBGAS provides the analysis of differential splicing across multiple biological conditions when RNA-sequencing data is available. MkcDBGAS is the first accurate and scalable method for detecting all seven types of AS events using the transcriptome alone, which will greatly empower the studies of AS in a wider field.
The workflow of MkcDBGAS
(A) All-versus-all alignment. (B) Constructing cDBGs and calling other-induced bubbles. A cDBG was constructed from two sequences using a k-mer. According to the topologies of bubbles, bubbles fall into three categories: SNV-induced, AS-induced and other-induced. (C) Reconstructing and incorporating sub-cDBGs. For each other-induced bubble, sequences of two arms were extracted to repeat the B step and obtained a sub-cDBG with a small k′-mer. All the sub-cDBGs were incorporated into the original cDBG by replacing the vertices and edges of the other-induced bubbles with new corresponding sub-cDBGs. (D) Calling bubbles. Three jobs are needed: (i) counting SNV-induced bubbles, (ii) identifying AF, AL and MX and (iii) identifying AS-induced bubbles. (E) Classification. AS events from human and Arabidopsis thaliana were used as training datasets to train two classifiers for four types of AS events based on XGBoost, respectively. (F) Analysis of differential splicing. When RNA seq data is available, we further conduct analysis of quantitative and differential splicing across multiple biological conditions.
Availability – All datasets and codes used in this study are available at GitHub: https://github.com/CMB-BNU/mkcDBGAS