Transcriptome sequencing (RNA-seq) is gradually replacing microarrays for high-throughput studies of gene expression. The main challenge of analyzing microarray data is not in finding differentially expressed genes, but in gaining insights into the biological processes underlying phenotypic differences. To interpret experimental results from microarrays, gene set analysis (GSA) has become the method of choice, in particular because it incorporates pre-existing biological knowledge (in a form of functionally related gene sets) into the analysis.
Schematic overview illustrating the breakup of the GSA methods that can be adapted from microarrays practice to fit RNA-seq data (boxes with dots) as well as those specifically designed for RNA-seq (boxes with diagonal stripes) based on the different null hypotheses they test.
Here researchers from the University of Arkansas for Medical Sciences provide a brief review of several statistically different GSA approaches (competitive and self-contained) that can be adapted from microarrays practice as well as those specifically designed for RNA-seq. They evaluate their performance (in terms of Type I error rate, power, robustness to the sample size and heterogeneity, as well as the sensitivity to different types of selection biases) on simulated and real RNA-seq data. Not surprisingly, the performance of various GSA approaches depends only on the statistical hypothesis they test and does not depend on whether the test was developed for microarrays or RNA-seq data. Interestingly, they found that competitive methods have lower power as well as robustness to the samples heterogeneity than self-contained methods, leading to poor results reproducibility. The researchers also found that the power of unsupervised competitive methods depends on the balance between up- and down-regulated genes in tested gene sets. These properties of competitive methods have been overlooked before. This evaluation provides a concise guideline for selecting GSA approaches, best performing under particular experimental settings in the context of RNA-seq.
Boxplots comparing (A) the number of genes in gene sets (gene set size), (B) the proportion of DE genes in gene sets and (C) the average gene length per gene set in detected C2 gene sets (among 3890 C2 gene sets, α = 0.05) found by different GSA approaches.