Detection of somatic point mutations using patients sequencing data has many clinical applications, including the identification of cancer driver genes, detection of mutational signatures, and estimation of tumor mutational burden (TMB). In a recent work researchers from the University of Maryland and the Israel Institute of Technology developed a tool for detection of somatic mutations using tumor RNA and matched-normal DNA. Here, they further extend it to detect somatic mutations from RNA sequencing data without a matched-normal sample. This is accomplished via a machine learning approach that classifies mutations as either somatic or germline based on various features. When applied to RNA-sequencing of >450 melanoma samples high precision and recall are achieved, and both mutational signatures and driver genes are correctly identified. Finally, the researchers show that RNA-based TMB is significantly associated with patient survival, with similar or superior performance to DNA-based TMB. This pipeline can be utilized in many future applications, analyzing novel and existing datasets where only RNA is available.
(a) An overview of the RNA-MuTect-WMN pipeline: In the training set (n=100, green arrows), RNA-MuTect is applied on tumor RNA and matched-normal DNA to obtain a list of variants labeled as somatic or germline. A random forest classifier is then trained with the collected set of features for each variant in a 5-fold cross validation manner. In the test set (n=362, orange arrows), (1) MuTect is applied with tumor RNA and without a matched-normal sample, to yield a list of mixed somatic and germline variants. (2) The five trained models are then applied to this set of variants and classifies them as either somatic or germline in a majority vote manner. (3) Finally, the predicted set of variants is further filtered by the RNA-MuTect filtering steps. (b) Precision and recall on validation (left) and test (right) sets computed for each sample. Box plots show median, 25th, and 75th percentiles. The whiskers extend to the most extreme data points not considered outliers, and the outliers are represented as dots. (c) Precision as the function of the number of true somatic mutations per sample. (d) Correlation between the number of predicted somatic mutations and the number of somatic mutations as determined by DNA with matched-normal DNA sample. (e) Correlation between the number of predicted somatic mutations and the number of somatic mutations as determined by RNA with a matched-normal DNA sample.