Finding a causal gene is a fundamental problem in genomic medicine. Researchers from the University of British Columbia and MIT have developed a causal inference framework, CoCoA-diff, that prioritizes disease genes by adjusting confounders without prior knowledge of control variables in single-cell RNA-seq data. The researchers demonstrate that their method substantially improves statistical power in simulations and real-world data analysis of 70k brain cells collected for dissecting Alzheimer’s disease. They identify 215 differentially regulated causal genes in various cell types, including highly relevant genes with a proper cell type context. Genes found in different types enrich distinctive pathways, implicating the importance of cell types in understanding multifaceted disease mechanisms.
Counterfactual confounder adjustment for single-cell
differential gene expression analysis (CoCoA-diff)
a Hierarchical (nested) structure of single-cell gene expression data. We have tens of individuals for one case-control study. Each individual (i) contains a heterogeneous mixture of multiple cell types. Single-cell technology measure a thousand genes on each cell (j). b This work seeks to address a specific causal inference problem of genomics research. We seek to prioritize genes causally modulated by a disease status, not the genes affecting the predisposition and risk of disease development. c Overview of CoCoA-diff approach (see Methods for details). Y: gene expression matrix. Y(0): counterfactual data with disease W = 0, Y(1): counterfactual data with disease W = 1. β: Poisson regression coefficient. δ: residual effect. ρ: sequencing depth. μ: shared confounding effect. d Data generation scheme for simulation studies. We simulate 50 causal and 9,950 non-causal genes with or without disease-causing mechanisms (an edge between W and λ). Wi: disease label assignment for an individual i. Xi: confounding effects for an individual i. λgi: unobserved gene expression for a gene g of an individual i as a function of X and W. Ygj: realization of cell-level gene expression of a gene g with a cell j-specific sequencing depth ρj (stochastically sampled from Gamma distribution). Here, we simulated five different X variables. e CoCoA-diff accurately estimates shared confounder variables (μg), showing a significantly higher level of correlation with true confounding effects on non-causal genes than a pseudo-bulk analysis. f CoCoA-diff accurately estimates disease-causing effects (δg), showing high correlation with true differential effects on causal genes. g Illustration of CoCoA-diff approach on the APOE in microglia example. HC, health control. AD, Alzheimer’s disease. μ, shared confounding effect; δ, residual differential effect. For a clear visualization, we omitted samples (individuals) with zero read count observed on APOE gene in the microglial cell type
Availability – The C++ source code of binary programs used in simulation and data analysis available in the following public repository, https://ypark.github.io/mmutil/