Differential expression analysis in single-cell transcriptomics enables the dissection of cell-type-specific responses to perturbations such as disease, trauma, or experimental manipulations. While many statistical methods are available to identify differentially expressed genes, the principles that distinguish these methods and their performance remain unclear.
Researchers at École Polytechnique Fédérale de Lausanne (EPFL) show that the relative performance of these methods is contingent on their ability to account for variation between biological replicates. Methods that ignore this inevitable variation are biased and prone to false discoveries. Indeed, the most widely used methods can discover hundreds of differentially expressed genes in the absence of biological differences. To exemplify these principles, the researchers exposed true and false discoveries of differentially expressed genes in the injured mouse spinal cord.
DE analysis of single-cell data must account for biological replicates
a Schematic illustration of the experiment shown in b, in which the aggregation procedure was disabled and pseudobulk DE methods were applied to individual cells. b Left, AUCC of the original fourteen DE methods, plus six pseudobulk methods applied to individual cells, in the eighteen ground-truth datasets. Right, Spearman correlation between ERCC mean expression and –log10 p-value assigned by six pseudobulk DE methods, before and after disabling the aggregation procedure. c Schematic illustration of the experiment shown in d, in which the replicate associated with each cell was shuffled to produce ‘pseudo-replicates.’ d Left, AUCC of the original fourteen DE methods, plus six pseudobulk methods applied to pseudo-replicates, in the eighteen ground-truth datasets. Right, Spearman correlation between ERCC mean expression and –log10 p-value assigned by six pseudobulk DE methods, before and after shuffling replicates to produce pseudo-replicates. e Variance of gene expression in pseudobulks formed from biological replicates and pseudo-replicates in mouse bone marrow mononuclear cells stimulated with poly-I:C. Shuffling the replicate associated with each cell produced a systematic decrease in the variance of gene expression. Right, pie chart shows the proportion of genes with increased or decreased variance in pseudo-replicates, as compared to biological replicates. f Decreases in the variance of gene expression in pseudo-replicates as compared to biological replicates across 46 scRNA-seq datasets. Points show the mean variance in biological replicates; arrowheads show the mean variance in pseudo-replicates. g Left, expression of the gene c in biological replicates (points) and pseudo-replicates (arrowheads) from unstimulated cells and cells stimulated with poly-I:C, with the range of possible pseudo-replicate expression values shown as a density. Right, mean (horizontal line) and variance (shaded area) of Txnrd3 expression in biological replicates (left) and pseudo-replicates (right). P-values were calculated by edgeR-LRT.