The field of transcriptomics uses and measures mRNA as a proxy of gene expression. There are currently two major platforms in use for quantifying mRNA, microarray and RNA-Seq. Many comparative studies have shown that their results are not always consistent. In this study University of Amsterdam researchers aim to find a robust method to increase comparability of both platforms enabling data analysis of merged data from both platforms. They transformed high dimensional transcriptomics data from two different platforms into a lower dimensional, and biologically relevant dataset by calculating enrichment scores based on gene set collections for all samples. The researchers compared the similarity between data from both platforms based on the raw data and on the enrichment scores. They show that the performed data transforms the data in a biologically relevant way and filters out noise which leads to increased platform concordance. They validate the procedure using predictive models built with microarray based enrichment scores to predict subtypes of breast cancer using enrichment scores based on sequenced data. Although microarray and RNA-Seq expression levels might appear different, transforming them into biologically relevant gene set enrichment scores significantly increases their correlation, which is a step forward in data integration of the two platforms. The gene set collections were shown to contain biologically relevant gene sets. More in-depth investigation on the effect of the composition, size, and number of gene sets that are used for the transformation is suggested for future research.
Demonstration of the enrichment score calculation with a simulated dataset
Top: sorted expression levels of 250 genes in which coloring represents gene set membership. Middle: Left, the calculated PH values for each gene set. Right, the PNH values for each gene set. Bottom: The resulting enrichment scores for each of the three gene sets.