A probabilistic approach for automated discovery of perturbed genes using expression data from microarray or RNA-Seq

In complex diseases, alterations of multiple molecular and cellular components in response to perturbations are indicative of disease physiology. While expression level of genes from high-throughput analysis can vary among patients, the common path among disease progression suggests that the underlying cellular sub-processes involving associated genes follow similar fates. Motivated by the interconnected nature of sub-processes, researchers at the University of Cincinnati have developed an automated methodology that combines ideas from biological networks, statistical models, and game theory, to probe connected cellular processes. The core concept in their approach uses probability of change (POC) to indicate the probability that a gene’s expression level has changed between two conditions. POC facilitates the definition of change at the neighborhood, pathway, and network levels and enables evaluation of the influence of diseases on the expression. The ‘connected’ disease-related genes (DRG) identified display coherent and concomitant differential expression levels along paths.


In the standard approach (top portion of figure) genes are analyzed for fold-change and p-value (statistical analysis) in order to identify differentially expressed genes. These feature genes are used, in some approaches, to seed a network and include additional genes before the list of genes is submitted for further analysis. In some cases, the list is directly used for further analysis, which may include discriminative machine learning and functional exploration using pathway databases such as DAVID. The bottom portion of the figure illustrates the POC approach. The POC approach begins by building a putative regulatory network using a combination of database. This network is overlaid with node and edge probabilities obtained from the POC analysis. A series of network analysis algorithms leads to selection of maximally altered paths (sequence of directly connected genes). The genes in the set of identified paths is then submitted for further analysis. In our analysis, the list of genes is further analyzed using a standard machine learning algorithm to evaluate the discriminatory power of the genes obtained by POC.

RNA-Seq and microarray breast cancer subtyping expression data sets were used to identify DRG between subtypes. A machine-learning algorithm was trained for subtype discrimination using the DRG, and the training yielded a set of biomarkers. The discriminative power of the biomarkers was tested using an unseen data set. Biomarkers identified overlaps with disease-specific identified genes, and we were able to classify disease subtypes with 100% and 80% agreement with PAM50, for microarray and RNA-Seq data set respectively.

Sundaramurthy G, Eghbalnia HR. (2015) A probabilistic approach for automated discovery of perturbed genes using expression data from microarray or RNA-Seq. Comput Biol Med 67:29-40. [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.