Large, comprehensive collections of single-cell RNA sequencing (scRNA-seq) datasets have been generated that allow for the full transcriptional characterization of cell types across a wide variety of biological and clinical conditions. As new methods arise to measure distinct cellular modalities, a key analytical challenge is to integrate these datasets or transfer knowledge from one to the other to better understand cellular identity and functions.
Researchers from Carnegie Mellon University have developed a simple yet surprisingly effective method named common factor integration and transfer learning (cFIT) for capturing various batch effects across experiments, technologies, subjects, and even species. The proposed method models the shared information between various datasets by a common factor space while allowing for unique distortions and shifts in genewise expression in each batch. The model parameters are learned under an iterative nonnegative matrix factorization (NMF) framework and then used for synchronized integration from across-domain assays. In addition, the model enables transferring via low-rank matrix from more informative data to allow for precise identification in data of lower quality. Compared with existing approaches, this method imposes weaker assumptions on the cell composition of each individual dataset; however, it is shown to be more reliable in preserving biological variations. The researchers apply cFIT to multiple scRNA-seq datasets of developing brain from human and mouse, varying by technologies and developmental stages. The successful integration and transfer uncover the transcriptional resemblance across systems. The study helps establish a comprehensive landscape of brain cell-type diversity and provides insights into brain development.
cFIT integration and transfer approach overview
(A) cFIT performs integration or transfer among scRNA-seq datasets from different batches, technologies, and across species. (B) Data integration takes in two or more datasets from different domains, where some cell-level biological processes are shared. Each dataset is modeled by a low-dimensional latent space corresponding to gene-level features (gene expression signatures), W, shared across domains, domain-specific factor loading characterizing cell composition, and domain-unique scaling, , and shift, , capturing the technical distinction. (C) The integration algorithm estimates the set of parameters through iterative NMF. The integrated data can be obtained by eliminating the technical distinctions and projected onto a common subspace, where downstream analysis can be performed, such as clustering and trajectory inference. (D) The transfer process takes a reference factor matrix representing the gene-level signature profiles and a target dataset sharing the signature space. (E) The transfer algorithm estimates the target-specific parameters to project the target data onto the same low-dimensional space as inferred from reference data. Cell labels can be assigned directly with unsupervised learning in low-dimensional space or querying reference data.