With the increasing availability of multi-dimensional biological datasets for the same samples (i.e., gene expression, microRNAs, copy numbers, mutations, methylations), it has now become possible to systematically understand the regulatory mechanisms operating in a cancer cell. For this task, it is important to discover a set of co-expressed genes with functions, representing a so-called functional gene module, because co-expressed genes tend to be co-regulated by the same regulators, including transcription factors, microRNAs, and copy number aberrations. Several algorithms have been used to identify such gene modules, including hierarchical clustering and non-negative matrix factorization. Although these algorithms have been applied to many microarray datasets, only a few systematic analyses of these algorithms have been performed for RNA-sequencing (RNA-Seq) data to date. Although gene expression levels determined based on microarray and RNA-Seq datasets tend to be highly correlated, the expression levels of some genes differ depending on the platforms used for analysis, which may result in the construction of different gene modules for the same samples.
Here, researchers from Gwangju Institute of Science and Technology compare several module detection algorithms applied to both microarray and RNA-seq datasets. They further propose a new functional gene module detection algorithm (FGMD), which is based on a hierarchical clustering algorithm that was modified to reflect actual biological observations, including the fact that a single gene may be involved in multiple biological pathways. Application of existing algorithms and the new FGMD algorithm to breast cancer and ovarian cancer datasets from The Cancer Genome Atlas showed that the FGMD algorithm had the best performance for most of the functional pathway enrichment tests and in the transcription factor enrichment test. The researchers expect that the FGMD algorithm will contribute to improving the identification of functional gene modules related to cancer.
Overview of the approach
(A) Collect gene expression data obtained from microarray and RNA-Seq platforms for paired samples, and calculate the log2 ratios between tumor samples and the average of normal samples. (B) Compare gene expression data of the microarray and RNA-Seq datasets. (C) Construct FGMD modules using the microarray and RNA-Seq gene expression data. (D) Compare the modules constructed by FGMD to those constructed by other methods. PCC, Pearson correlation coefficient; TF, transcription factor.