Isoforms of human miRNAs (isomiRs) are constitutively expressed with tissue- and disease-subtype-dependencies. Researchers from Thomas Jefferson University studied 10 271 tumor datasets from The Cancer Genome Atlas (TCGA) to evaluate whether isomiRs can distinguish amongst 32 TCGA cancers. Unlike previous approaches, they built a classifier that relied solely on ‘binarized’ isomiR profiles: each isomiR is simply labeled as ‘present’ or ‘absent’. The resulting classifier successfully labeled tumor datasets with an average sensitivity of 90% and a false discovery rate (FDR) of 3%, surpassing the performance of expression-based classification. The classifier maintained its power even after a 15× reduction in the number of isomiRs that were used for training. Notably, the classifier could correctly predict the cancer type in non-TCGA datasets from diverse platforms. The researchers analysis revealed that the most discriminatory isomiRs happen to also be differentially expressed between normal tissue and cancer. Even so, they find that these highly discriminating isomiRs have not been attracting the most research attention in the literature. Given their ability to successfully classify datasets from 32 cancers, isomiRs and this ‘Pan-cancer Atlas’ of isomiR expression could serve as a suitable framework to explore novel cancer biomarkers.
Support vector machines (SVMs) correctly classify 32 cancer types
(A and B) SVM classification using the binarized isomiR (A) or the miRNA arm (B) expression profile. Each row of the heatmap represents the original and each column the predicted cancer class. The color of each cell in the heatmap is proportional to the percentage (%) of samples originally as the cancer type in the respective row to be predicted as the cancer type of the respective column. The % is calculated as the average across 1 000 iterations. (C and D) Sensitivity (C) and FDR (D) scores for the SVM models built using the binarized isomiR (magenta) or miRNA arm (yellow) expression profiles. The points at the bottom of the distribution represent the sensitivity (C) and FDR (D) scores from the 10-fold cross-validation analysis.