The vast majority of human multiexon genes undergo alternative splicing and produce a variety of splice variant transcripts and proteins, which can perform different functions. These protein-coding splice variants (PCSVs) greatly increase the functional diversity of proteins. Most functional annotation algorithms have been developed at the gene level; the lack of isoform-level gold standards is an important intellectual limitation for currently available machine learning algorithms. The accumulation of a large amount of RNA-seq data in the public domain greatly increases our ability to examine the functional annotation of genes at isoform level.
University of Michigan researchers have used a multiple instance learning (MIL)-based approach for predicting the function of PCSVs. They used transcript-level expression values and gene-level functional associations from the Gene Ontology database. A support vector machine (SVM)-based 5-fold cross-validation technique was applied. Comparatively, genes with multiple PCSVs performed better than single PCSV genes, and performance also improved when more examples were available to train the models. The researchers demonstrated their predictions using literature evidence of ADAM15, LMNA/C, and DMXL2 genes.
Overview of data preprocessing for predicting protein-coding splice variants
The researchers collected RNA-seq data from the ENCODE project and estimated the expression values using standard tools and thresholds. These values were used as input features for developing SVM models for different GO terms using the multiple instance learning approach.
Availability – All predictions have been implemented in a web resource called “IsoFunc”, which is freely available for the global scientific community through http://guanlab.ccmb.med.umich.edu/isofunc.