Large, publicly available gene expression datasets are often analyzed with the aid of machine learning algorithms. Although RNA-seq is increasingly the technology of choice, a wealth of expression data already exist in the form of microarray data. If machine learning models built from legacy data can be applied to RNA-seq data, larger, more diverse training datasets can be created and validation can be performed on newly generated data.
Researchers from the Geisel School of Medicine at Dartmouth developed Training Distribution Matching (TDM), which transforms RNA-seq data for use with models constructed from legacy platforms.
They evaluated TDM, as well as quantile normalization, nonparanormal transformation, and a simple log 2 transformation, on both simulated and biological datasets of gene expression. Their evaluation included both supervised and unsupervised machine learning approaches. They found that TDM exhibited consistently strong performance across settings and that quantile normalization also performed well in many circumstances.
Results for Dataset 1. (A) Mean total accuracy for BRCA subtype classification across ten iterations with 95% confidence intervals. Dashed line represents the “no information rate” that could be achieved by always picking the most common class. NPN had the highest mean total accuracy on these data, followed by TDM, then quantile normalization, and log2 transformation respectively. The untransformed RNA-seq data performed the worst. (B) Mean Kappa for BRCA subtype classification across ten iterations. NPN had the highest mean Kappa on these data, followed by TDM, which was then followed by quantile normalization and log2 transformation. The untransformed RNA-seq data performed the worst.
Availability – A TDM package for the R programming language is available at: https://github.com/greenelab/TDM