bagSVM – Classification of RNA-Seq Data via Bagging Support Vector Machines

RNA sequencing (RNA-Seq) is a powerful technique for transcriptome profiling of the organisms that uses the capabilities of next-generation sequencing (NGS) technologies. Recent advances in NGS let to measure the expression levels of tens to thousands of transcripts simultaneously. Using such information, developing expression-based classification algorithms is an emerging powerful method for diagnosis, disease classification and monitoring at molecular level, as well as providing potential markers of disease.

Here, a team led by researchers at Erciyes University, Turkey have developed the bagging support vector machines (bagSVM), a machine learning approach and bagged ensembles of support vector machines (SVM), for classification of RNA-Seq data. The bagSVM basically uses bootstrap technique and trains each single SVM separately; next it combines the results of each SVM model using majority-voting technique.

rna-seq

The researchers demonstrate the performance of the bagSVM on simulated and real datasets. Simulated datasets are generated from negative binomial distribution under different scenarios and real datasets are obtained from publicly available resources. A deseq normalization and variance stabilizing transformation (vst) were applied to all datasets. They compared the results with several classifiers including Poisson linear discriminant analysis (PLDA), single SVM, classification and regression trees (CART), and random forests (RF). In slightly overdispersed data, all methods, except CART algorithm, performed well. Performance of PLDA seemed to be best and RF as second best for very slightly and substantially overdispersed datasets. While data become more spread, bagSVM turned out to be the best classifier. In overall results, bagSVM and PLDA had the highest accuracies.

According to these results, bagSVM algorithm after vst transformation can be a good choice of classifier for RNA-Seq datasets mostly for overdispersed ones. PLDA algorithm should be a method of choice for slight and moderately overdispersed datasets.

Availability – An R/BIOCONDUCTOR package MLSeq with a vignette is freely available at: http://www.bioconductor.org/packages/2.14/bioc/html/MLSeq.htm

Zararsiz G, Goksuluk D, Korkmaz S, Eldem V, Duru IP, Ozturk A, Unver T. (2014) Classification of RNA-Seq Data via Bagging Support Vector Machines. bioRxiv [Epub ahead of print]. [abstract]