A central task of bioinformatics is to develop sensitive and specific means of providing medical prognoses from biomarker patterns. Common methods to predict phenotypes in RNA-Seq datasets utilize machine learning algorithms trained via gene expression. Isoforms, however, generated from alternative splicing, may provide a novel and complementary set of transcripts for phenotype prediction. In contrast to gene expression, the number of isoforms increases significantly due to numerous alternative splicing patterns, resulting in a prioritization problem for many machine learning algorithms. Researchers from Iowa State University identify the empirically optimal methods of transcript quantification, feature engineering and filtering steps using phenotype prediction accuracy as a metric. At the same time, the complementary nature of gene and isoform data is analyzed and the feasibility of identifying isoforms as biomarker candidates is examined.
Isoform features are complementary to gene features, providing non-redundant information and enhanced predictive power when prioritized and filtered. A univariate filtering algorithm, which selects up to the N highest ranking features for phenotype prediction is described and evaluated in this study. An empirical comparison of pipelines for isoform quantification is reported by performing cross-validation prediction tests with datasets from human non-small cell lung cancer (NSCLC) patients, human patients with chronic obstructive pulmonary disease (COPD) and amyotrophic lateral sclerosis (ALS) transgenic mice, each including samples of diseased and non-diseased phenotypes.
Phenotype prediction by pipelines using variable transcript quantification tools and filtering
AUC values were generated by running each dataset (NSCLC, ALS, COPD) through 54 pipelines that varied in transcript quantification tool, feature type (gene, isoform count, isoform fraction), use of filtering and machine learning algorithm (Random Forest, Elastic Net, SPLS). Predictive results are shown grouped by dataset, transcript quantification tool and use of filtering
Availability – https://github.com/clabuzze/Phenotype-Prediction-Pipeline.git