ConSReg – prediction of condition-specific regulatory genes using machine learning

Recent advances in genomic technologies have generated data on large-scale protein–DNA interactions and open chromatin regions for many eukaryotic species. How to identify condition-specific functions of transcription factors using these data has become a major challenge in genomic research. To solve this problem, Virginia Tech researchers have developed a method called ConSReg, which provides a novel approach to integrate regulatory genomic data into predictive machine learning models of key regulatory genes. Using Arabidopsis as a model system, the researchers tested their approach to identify regulatory genes in data sets from single cell gene expression and from abiotic stress treatments. The results showed that ConSReg accurately predicted transcription factors that regulate differentially expressed genes with an average auROC of 0.84, which is 23.5–25% better than enrichment-based approaches. To further validate the performance of ConSReg, they analyzed an independent data set related to plant nitrogen responses. ConSReg provided better rankings of the correct transcription factors in 61.7% of cases, which is three times better than other plant tools. The researchers applied ConSReg to Arabidopsis single cell RNA-seq data, successfully identifying candidate regulatory genes that control cell wall formation. These methods provide a new approach to define candidate regulatory genes using integrated genomic data in plants.

Flowchart of ConSReg pipeline

Flowchart of ConSReg pipeline. (A) Analysis workflow. (B) Genomic data integration strategy. DAP-seq and ATAC-seq regions were intersected and the weight for each intersected region was computed, and then summed up as the final weight for each TF–gene pair. The product of TF fold change and final weight is filled into corresponding entry of the feature matrix (see Materials and Methods for more details). parameters a, b, c, d, e, f, g are lengths of corresponding regions. (C) Cross-validation strategy. Final AUC–ROC values were computed from the 20% test data. We repeated this analysis five times for each integrated data set and calculated average and standard deviation of AUC–ROC values.

(A) Analysis workflow. (B) Genomic data integration strategy. DAP-seq and ATAC-seq regions were intersected and the weight for each intersected region was computed, and then summed up as the final weight for each TF–gene pair. The product of TF fold change and final weight is filled into corresponding entry of the feature matrix (see Materials and Methods for more details). parameters a, b, c, d, e, f, g are lengths of corresponding regions. (C) Cross-validation strategy. Final AUC–ROC values were computed from the 20% test data. We repeated this analysis five times for each integrated data set and calculated average and standard deviation of AUC–ROC values.

Availability – ConSReg is implemented as an open source python package and is freely available at GitHub repository: https://github.com/LiLabAtVT/ConSReg

Song Q, Lee J, Akter S, Rogers M, Grene R, Li S. (2020) Prediction of condition-specific regulatory genes using machine learning. Nucleic Acids Research [Epub ahead of print]. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.