McSplicer – a probabilistic model for estimating splice site usage from RNA-seq data

Alternative splicing removes intronic sequences from transcripts in alternative ways to produce different forms (isoforms) of mature mRNA. The composition of expressed transcripts and their alternative forms give specific functionalities to cells in a particular condition or developmental stage. In addition, a large fraction of human disease mutations affect splicing and lead to aberrant mRNA and protein products. Current methods that interrogate the transcriptome based on RNA-seq either suffer from short read length when trying to infer full-length transcripts, or are restricted to predefined units of alternative splicing that they quantify from local read evidence. Instead of attempting to quantify individual outcomes of the splicing process such as local splicing events or full-length transcripts, researchers from Ludwig-Maximilians-Universitat Munchen propose to quantify alternative splicing using a simplified probabilistic model of the underlying splicing process. Their model is based on the usage of individual splice sites and can generate arbitrarily complex types of splicing patterns. In this method, McSplicer, the researchers estimate the parameters of their model using all read data at once and they demonstrate in their experiments that this yields more accurate estimates compared to competing methods. This model is able to describe multiple effects of splicing mutations using few, easy to interpret parameters, as the researchers illustrate in an experiment on RNA-seq data from autism spectrum disorder patients.

McSplicer workflow summary

rna-seq
The main steps of the McSplicer analysis are: A) Map RNA-seq reads to the reference genome sequence. B) Identify annotated as well as novel splice sites through the reference-based assembly of transcripts using, e.g., StringTie (Pertea et al., 2015). C) Divide the gene into non-overlapping segments bounded by splice sites, TSS and TES and count the number of reads mapping to distinct combinations of segments. In this example, only the start of the first exon and the end of the last exon are bounded by TSS and TES, respectively, the remaining exon start and end sites correspond to splice sites. D) Estimate splice site usages using McSplicer. E) Leverage splice site usages in various kinds of downstream analyses, such as the quantification of different types of alternative splicing events.

Availability – McSplicer is implemented in Python and available as open-source at https://github.com/canzarlab/McSplicer.

Alqassem I, Sonthalia Y, Klitzke-Feser E, Shim H, Canzar S. (2021) McSplicer: a probabilistic model for estimating splice site usage from RNA-seq data. bioRXiv [online preprint]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.