scGPT – towards building a foundation model for single-cell multi-omics using generative AI

Generative pre-trained models have achieved remarkable success in various domains such as natural language processing and computer vision. Specifically, the combination of large-scale diverse datasets and pre-trained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between linguistic constructs and cellular biology — where texts comprise words, similarly, cells are defined by genes — this study probes the applicability of foundation models to advance cellular biology and genetics research. Utilizing the burgeoning single-cell sequencing data, researchers at the University of Toronto have pioneered the construction of a foundation model for single-cell biology, scGPT, which is based on generative pre-trained transformer across a repository of over 33 million cells. These findings illustrate that scGPT, a generative pre-trained transformer, effectively distills critical biological insights concerning genes and cells. Through the further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell-type annotation, multi-batch integration, multi-omic integration, genetic perturbation prediction, and gene network inference.

Model Schematic

(A) The workflow of scGPT. The model is generatively pre-trained on large-scale scRNA-seq data from cell atlases. For downstream applications, the pre-trained model parameters can be fine-tuned on new data. The core component of scGPT contains stacked transformer blocks with specialized attention masks for generative training. We applied scGPT on a variety of tasks including cell type annotation, batch correction, multi-omic integration, genetic perturbation prediction, and gene network inference. (B) The detailed view of the input data embeddings. The input contains three layers of information, gene tokens, expression values, and condition tokens (modality, batch, perturbation conditions, etc.). (C) The detailed view of the scGPT transformer layer. We introduced a specially designed attention mask in the Masked Multi-Head Attention block to conduct generative pre-training on single-cell sequencing data. (D) The diagram illustrating the size of training data and the organs of origin. The scGPT whole-human model was pre-trained on the scRNA-seq data of 33 million normal human cells. (E) UMAP visualization of the pre-trained scGPT cell embeddings (a random 10% subset), colored by major cell types.

Availability – The scGPT codebase is publicly available at

Cui H, Wang C, Maan H, Pang K, Luo F, Wang B. (2023) scGPT: Towards Building a Foundation Model for Single-Cell Multi-omics Using Generative AI. bioRXiv [online preprint]. [article]

One comment

  1. Outstanding achievements by Bowang Lab are commendable. Previously, the task of cell identification and annotation presented a significant challenge, yet the introduction of ScGPT has streamlined the process with marked improvements in precision. It is noted with regret that complimentary cloud services typically restrict data allowances to 100MB, which stands in stark contrast to the multi-gigabyte size of numerous scRNAseq h5ad files. The provision of expanded resources would be exceptionally advantageous for the meticulous process of cell annotation.

Leave a Reply

Your email address will not be published. Required fields are marked *


Time limit is exhausted. Please reload CAPTCHA.