Carcinoma of unknown primary (CUP) is a type of metastatic cancer with tissue-of-origin (TOO) unidentifiable by traditional methods. CUP patients typically have poor prognosis but therapy targeting the original cancer tissue can significantly improve patients’ prognosis. Thus, it’s critical to develop accurate computational methods to infer cancer TOO. While qPCR or microarray-based methods are effective in inferring TOO for most cancer types, the overall prediction accuracy is yet to be improved. In this study, researchers from Changsha Medical University propose a cross-cohort computational framework to trace TOO of 32 cancer types based on RNA sequencing (RNA-seq). Specifically, the researchers employed logistic regression models to select 80 genes for each cancer type to create a combined 1356-gene set, based on transcriptomic data from 9911 tissue samples covering the 32 cancer types with known TOO from the Cancer Genome Atlas (TCGA). The selected genes are enriched in both tissue-specific and tissue-general functions. The cross-validation accuracy of this framework reaches 97.50% across all cancer types. Furthermore, they tested the performance of this model on the TCGA metastatic dataset and International Cancer Genome Consortium (ICGC) dataset, achieving an accuracy of 91.09% and 82.67%, respectively, despite the differences in experiment procedures and pipelines.
Datasets and flowchart of this work. TCGA primary dataset was used to evaluate the different combinations of preprocessing, feature selection and classification methods. The best combination will be used to train on the TCGA primary dataset and the trained model will be used to test on independent datasets.
Availability – The code is available at http://github.com/wangbo00129/classifybysklearn