Publicly available genomic data are a valuable resource for studying normal human variation and disease, but these data are often not well labeled or annotated. The lack of phenotype information for public genomic data severely limits their utility for addressing targeted biological questions. Researchers from the Johns Hopkins Bloomberg School of Public Health have developed an in silico phenotyping approach for predicting critical missing annotation directly from genomic measurements using well-annotated genomic and phenotypic data produced by consortia like TCGA and GTEx as training data. They apply in silico phenotyping to a set of 70 000 RNA-seq samples they recently processed on a common pipeline as part of the recount2 project. The researchers use gene expression data to build and evaluate predictors for both biological phenotypes (sex, tissue, sample source) and experimental conditions (sequencing strategy). They demonstrate how these predictions can be used to study cross-sample properties of public genomic data, select genomic projects with specific characteristics, and perform downstream analyses using predicted phenotypes.
General approach to phenotype prediction
To predict phenotype information, the training data are first randomly divided and the predictor is built. Accuracy is first tested in the training data. Upon achieving sufficient accuracy (≥85%), the predictor is tested in the remaining half of the training data set. Phenotypes can then be predicted across all samples in recount2.
Availability – Code for the R package phenopredict is available at https://github.com/leekgroup/phenopredict.