Isoform-level expression data is more informative for biological classification with RNA-Seq

The extent to which the genes are expressed in the cell can be simplistically defined as a function of one or more factors of the environment, lifestyle, and genetics. RNA sequencing (RNA-Seq) is becoming a prevalent approach to quantify gene expression, and is expected to gain better insights to a number of biological and biomedical questions, compared to the DNA microarrays. Most importantly, RNA-Seq allows to quantify expression at the gene and alternative splicing isoform levels. However, leveraging the RNA-Seq data requires development of new data mining and analytics methods. Supervised machine learning methods are commonly used approaches for biological data analysis, and have recently gained attention for their applications to the RNA-Seq data.

In this work, Worcester Polytechnic Institute researchers assess the utility of supervised learning methods trained on RNA-Seq data for a diverse range of biological classification tasks. They hypothesize that the isoform-level expression data is more informative for biological classification tasks than the gene-level expression data. Their large-scale assessment is done through utilizing multiple datasets, organisms, lab groups, and RNA-Seq analysis pipelines. Overall, the reesearchers performed and assessed 61 biological classification problems that leverage three independent RNA-Seq datasets and include over 2,000 samples that come from multiple organisms, lab groups, and RNA-Seq analyses. These 61 problems include predictions of the tissue type, sex, or age of the sample, healthy or cancerous phenotypes and, the pathological tumor stage for the samples from the cancerous tissue. For each classification problem, the performance of three normalization techniques and six machine learning classifiers was explored. They find that for every single classification problem, the isoform-based classifiers outperform or are comparable with gene expression based methods. The top-performing supervised learning techniques reached a near perfect classification accuracy, demonstrating the utility of supervised learning for RNA-Seq based data analysis.

Overall computational pipeline used in this work


The samples from each of the three datasets are collected. The classification tasks are then defined. The expression data are processed for each sample at the gene and isoforms levels using two RNA processing pipelines and three different count measures. Next, feature pre-processing, scaling, and selection are done for each classification task. Finally, the binary as well as multiclass supervised classifiers are trained and tested.
Johnson NT, Dhroso A, Hughes KJ, Korkin D. (2017) Biological classification with RNA-Seq data: Can alternative splicing enhance machine learning classifier? bioRXiv [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.