Modeling bifurcations in single-cell transcriptomics data has become an increasingly popular field of research. Several methods have been proposed to infer bifurcation structure from such data, but all rely on heuristic non-probabilistic inference. Here University of Oxford researchers propose the first generative, fully probabilistic model for such inference based on a Bayesian hierarchical mixture of factor analyzers. Their model exhibits competitive performance on large datasets despite implementing full Markov-Chain Monte Carlo sampling, and its unique hierarchical prior structure enables automatic determination of genes driving the bifurcation process. The researchers additionally propose an Empirical-Bayes like extension that deals with the high levels of zero-inflation in single-cell RNA-seq data and quantify when such models are useful. They apply their model to both real and simulated single-cell gene expression data and compare the results to existing pseudotime methods. Finally, they discuss both the merits and weaknesses of such a unified, probabilistic approach in the context practical bioinformatics analyses.
Multiple solutions to bifurcation inference
Starting with three cell states, we would like to infer a bifurcation process from one to the other two. If a single gene is up-regulated in one of the states, yet down-regulated in the other two, then clearly any state may act as the beginning of the trajectory. For example, if we start in state 1 then the gene is up-regulated along state 2 and stays constant in state 3; if we start in state 2 then the gene is down-regulated in states 1 & 3; if we start in state 3 then the gene is up-regulated in state 2 and remains down-regulated in state 1. However, due to the non-identifiability this is true if we add additional genes that are up-regulated in one or two of the cell states. The equivalent geometric argument is that we can build the transcriptomic profiles across all genes by spinning the figure about B (with possible inversion) and “adding” that gene. No matter how many additional genes we add, any one of the three states can act as the root state or beginning of pseudotime. Therefore, in the absence of any additional information there are always three equally valid solutions to bifurcation inference from gene expression data alone.