In biological studies, identifying biological features that are significantly different under different experimental conditions or disease states is important to understand the biological mechanisms behind phenotypes — an individual’s, or organism’s, observable traits, such as height, eye color, and blood type.
Among these features, gene expression is the one most commonly studied by scientists. The development of a technique called RNA sequencing, or RNA-seq — which allows biologists to study all genes at a time, revealing the presence and quantity of RNA molecules in a biological sample — has made it easier and faster for scientists to identify differentially expressed genes (DEGs) at the genome-wide level.
However, the generally small sample size of RNA-seq data (usually only 2-4 biological replicates per condition) makes it difficult to identify DEGs accurately. (Previous researchers have developed statistical methods based on parametric distributional assumptions and the empirical Bayes approach to improve statistical power in small samples, such as two popular methods among scientists: DESeq2 and edgeR.) As the cost of sequencing has declined, the sample size of RNA-seq data has gradually increased, reaching hundreds or even thousands, in some population-level studies. This large increase raises the question of whether methods like DESeq2 and edgeR, designed for small-sample-size data, are still applicable to much larger population-level RNA-seq data sets.
Now, researchers from UCLA and UC Irvine have answered this question and published a paper on March 15 titled “Exaggerated false positives by popular differential expression methods when analyzing human population samples” in the journal Genome Biology.
The researchers discovered through permutation analysis that DESeq2 and edgeR have extremely high false discovery rates. False discovery rate (FDR) is a statistical concept that describes the reliability of discoveries identified by a method. The smaller the FDR, the more reliable the discoveries. This concept has been widely used as a criterion in bioinformatics tools for analyzing various biological data, such as genomics and proteomics data.
By further evaluating multiple DEG analysis methods, the researchers found that only the Wilcoxon rank-sum test could control the FDR and achieve good power.
Exaggerated false DEGs identified by DESeq2 and edgeR
from anti-PD-1 therapy RNA-seq datasets
A Barplot showing the average numbers of DEGs (left y-axis) and the proportion of DEGs out of all genes (right y-axis) identified from 1000 permuted datasets. The error bars represent the standard deviations of 1000 permutations. The red dots indicate the numbers of DEGs identified from the original dataset. B The distributions of the number of permuted datasets where a gene was mistakenly identified as a DEG. The percentages corresponding to the numbers are listed in parentheses below the numbers. C Barplot showing the average numbers of DEGs (left y-axis) and the proportion of DEGs out of all genes (right y-axis) identified from both the original dataset and any of the 1000 permuted datasets. The error bars represent the standard deviations of 1000 permutations. The red dots indicate the numbers of DEGs identified from the original dataset. D Percentage of permuted datasets where a DEG identified from the original dataset was also identified as a DEG. The genes are sorted by absolute log2(fold-change) in the original dataset in decreasing order. The absolute log2(fold-change) values corresponding to the ranks are listed in parentheses below the ranks. The line is fitted using the loess method, and the shaded areas represent 95% confidential intervals. E GO term enrichment for the DEGs identified from at least 10% permuted datasets. The top 5 enriched biological processes GO terms are shown. The analyses were performed using R package clusterProfiler. P.adjust represents the adjusted p-value using the Benjamini & Hochberg method. F Violin plots showing the poorness of fitting the negative binomial model to the genes identified by DESeq2 or edgeR as DEGs from ≥ 20% vs. ≤ 0.1% permuted datasets. The poorness of fit for each gene is defined as its negative log10(p-value) from the goodness-of-fit test for the negative binomial distributions estimated by DESeq2 or edgeR. The p-value in each panel was calculated by the Wilcoxon rank-sum test to compare the two groups of genes’ poorness-of-fit values
“Researchers demand reliable discoveries that contain few false discoveries, and for population-level RNA-seq studies, we recommend the Wilcoxon rank-sum test,” said Jingyi “Jessica” Li, the study’s senior author, who is a UCLA associate professor of statistics and is affiliated with UCLA’s Bioinformatics Ph.D. program.
Source – UCLA