Gene expression levels are dynamic molecular phenotypes that respond to biological, environmental, and technical perturbations. Here, University of Washington researchers use a novel replicate classifier approach for discovering transcriptional signatures and apply it to the Genotype-Tissue Expression (GTEx) data set. They identified many factors contributing to expression heterogeneity, such as collection center and ischemia time and their approach of scoring replicate classifiers allows us to statistically stratify these factors by effect strength. Strikingly, from transcriptional expression in blood alone they detect markers that help predict heart disease and stroke in some patients. These results illustrate the challenges and opportunities of interpreting patterns of transcriptional variation in large-scale data sets.
Classification accuracy for tissue types and confounding factors in the GTEx project
a) 100 “technical replicate” Random Forest classifiers were generated for each tissue type and median receiver operator characteristic (ROC) area under the curve (AUC) scores were calculated. Scores are between 1.0 (perfect classification) and 0.5 (random guessing). Median AUC scores are shown as blue dots with bootstrapped median 95% confidence intervals. For each classifier we also permuted the labels and recalculated the ROC-AUCs (red dots) to provide an unbiased null score for each tissue type. Relatively low ROC-AUCs exposed digestion related gene expression heterogeneity in stomach samples (b). RF classifier accuracy also indicates confounding factors in blood samples, including cause of death when (c) normalizing only for run bias or (d) after DESeq/PEER normalization with sex as a cofactor.