Constructing gene coexpression networks is a powerful approach for analyzing high-throughput gene expression data towards module identification, gene function prediction, and disease-gene prioritization. While optimal workflows for constructing coexpression networks, including good choices for data pre-processing, normalization, and network transformation, have been developed for microarray-based expression data, such well-tested choices do not exist for RNA-seq data. Almost all studies that compare data processing and normalization methods for RNA-seq focus on the end goal of determining differential gene expression.
Researchers from Michigan State University present a comprehensive benchmarking and analysis of 36 different workflows, each with a unique set of normalization and network transformation methods, for constructing coexpression networks from RNA-seq datasets. The researchers test these workflows on both large, homogenous datasets and small, heterogeneous datasets from various labs. They analyze the workflows in terms of aggregate performance, individual method choices, and the impact of multiple dataset experimental factors. Their results demonstrate that between-sample normalization has the biggest impact, with counts adjusted by size factors producing networks that most accurately recapitulate known tissue-naive and tissue-aware gene functional relationships.
Pipeline for benchmarking the optimal workflow
for constructing coexpression networks from RNA-seq data
The main pipeline was executed for the original GTEx and SRA datasets and a large collection of datasets of different sizes resampled from the GTEx datasets. Three key stages—within-sample normalization, between-sample normalization, and network transformation—where we tested method choices are highlighted in different colors. All the other stages were composed of standard selection, filtering, and data transformation operations. The coexpression networks resulting from all the workflows were evaluated using two gold-standards that capture generic (tissue-naive) and tissue-aware gene functional relationships. Finally, all the evaluation results were used to analyze the impact of various aspects of the workflows, methods, and datasets on the accuracy of coexpression networks. Abbreviations: CPM (counts per million), RPKM (reads per kilobase million), TPM (transcripts per million), QNT (quantile), TMM (trimmed mean of M values), UQ (upper quartile), CTF (counts adjusted with TMM factors), CUF (counts adjusted with upper quartile factors), CLR (context likelihood of relatedness), and WTO (weighted topological overlap)