TWO-SIGMA-G – a new competitive gene set testing framework for scRNA-seq data accounting for inter-gene and cell–cell correlation

Researchers at Harvard T.H. Chan School of Public Health and the University of North Carolina at Chapel Hill propose TWO-SIGMA-G, a competitive gene set test for scRNA-seq data. TWO-SIGMA-G uses a mixed-effects regression model based on their previously published TWO-SIGMA to test for differential expression at the gene-level. This regression-based model provides flexibility and rigor at the gene-level in (1) handling complex experimental designs, (2) accounting for the correlation between biological replicates and (3) accommodating the distribution of scRNA-seq data to improve statistical inference. Moreover, TWO-SIGMA-G uses a novel approach to adjust for inter-gene-correlation (IGC) at the set-level to control the set-level false positive rate. Simulations demonstrate that TWO-SIGMA-G preserves type-I error and increases power in the presence of IGC compared with other methods. Application to two datasets identified HIV-associated interferon pathways in xenograft mice and pathways associated with Alzheimer’s disease progression in humans.

Type-I error performance for CAMERA, MAST, and TWO-SIGMA-G
using a reference set size of 30 genes

Each panel varies the existence of IGC between genes in the test set and the presence of gene-level random effect terms in the gene-level model (CAMERA never includes gene-level random effect terms). Within each panel, both unadjusted and adjusted set-level p-values are plotted (unadjusted p-values are unavailable for MAST). Each boxplot aggregates six different settings which vary both the magnitude of the average inter-gene correlation (where applicable) in the test set and the nature of the correlation structure via the introduction of other individual-level covariates. Such settings are intended to represent the diversity seen in real data sets to paint an accurate picture of testing properties over a wide range of gene sets. Each of the six settings is further composed of 10 replicates which vary only random seed to mimic the impact of a different starting pool of cells from which genes were simulated. See the Methods section for more details regarding the simulation procedure.

Availability – TWO-SIGMA-G is implemented in the function twosigmag in the twosigma R package, which is freely available on GitHub at https://github.com/edvanburen/twosigma.

Van Buren E, Hu M, Cheng L, Wrobel J, Wilhelmsen K, Su L, Li Y, Wu D. (2022) TWO-SIGMA-G: a new competitive gene set testing framework for scRNA-seq data accounting for inter-gene and cell-cell correlation. Brief Bioinform [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.