UCLA researchers have developed an “all-in-one,” next-generation statistical simulator capable of assimilating a wide range of information to generate realistic synthetic data and provide a benchmarking tool for medical and biological researchers who use advanced technologies to study diseases and potential therapies. Specifically, the new computer-modeling – or “in silico” – system can help researchers evaluate and validate computational methods.
Single-cell RNA sequencing, called single-cell transcriptomics, is the foundation for analyzing genetic makeup (genome-wide gene expression levels) of cells. The introduction of additional “omics” offered detail on a range of molecular features, and in recent years, spatial transcriptomic technologies made it possible to profile gene expression levels with spatial location information of cell “neighborhoods,” showing precise locations and movements of cells within tissue.
“Thousands of computational methods have been developed to analyze single-cell and spatial omics data for a variety of tasks, making method benchmarking a pressing challenge for method developers and uses,” said Jingyi Jessica Li, PhD, a UCLA researcher and professor in statistics, biostatistics, computational medicine, and human genetics. Li is also affiliated with the Gene Regulation research area at the UCLA Jonsson Comprehensive Cancer Center. Li leads a research group titled the Junction of Statistics and Biology.
“Although simulators have evolved and become more powerful, there are numerous limitations. Few can generate realistic single-cell RNA sequencing data from continuous cell trajectories by mimicking real data, and most lack the ability to simulate data of multi-omics and spatial transcriptomics. We introduced the scDesign3, which we believe is the most realistic and versatile simulator to date, to fill the gap between researchers’ benchmarking needs and the limitations of existing tools,” said Li, senior author of a study published May 11 in Nature Biotechnology.
scDesign3 generates realistic synthetic data of diverse
single-cell and spatial omics technologies
a, An overview of scDesign3’s simulation functionalities: cell states (for example, discrete cell types, continuous trajectories and spatial locations); multiomics modalities (for example, RNA sequencing (RNA-seq), ATAC-seq, CITE-seq and methylation); and experimental designs (for example, batches, conditions, sex and age). ADT, antibody derived tag. b,c, scDesign3 outperformed existing simulators scGAN, muscat, SPARSim and ZINB-WaVE in simulating scRNA-seq datasets with a single trajectory (b) and bifurcating trajectories (c). Larger mLISI values represent better resemblance between synthetic data and test data. d,e, scDesign3 simulated realistic gene expression patterns in spatial transcriptomics datasets measured by 10x Visium (d) and Slide-seq (e). Large Pearson correlation coefficients (r) represent similar spatial patterns in synthetic data and test data. f, Using paired scRNA-seq data and spatial transcriptomics data (MOB-SC and MOB-SP in Supplementary Table 2) as input, we defined the ‘ground truth’ cell-type proportions at each spot (left), with the cell types including granule cells (GC), periglomerular cells (PGC), mitral/tufted cells (M/TC) and olfactory sensory neurons (OSNs). Each color represents a cell type. With the cell-type proportions, scDesign3 generated synthetic spatial transcriptomics data in which every spot is a mixture of synthetic single cells, given the spot’s cell-type proportions. The four cell-type marker genes exhibit similar spatial expression patterns in real data (right top) and synthetic data (right bottom). Large r values represent similar expression patterns in synthetic data and test data. g, scDesign3 simulated a realistic scATAC-seq dataset at the count level. DC, dendritic cells; DN T, double-negative T cells; mono, monocytes; NK, natural killer cells; pDC, plasmacytoid dendritic cells. h, scDesign3 simulated a realistic sci-ATAC-seq dataset at both the count level (left, Uniform Manifold Approximation and Projection (UMAP) visualizations of real and synthetic cells based on peak counts) and the read level when coupled with scReadSim30 (right, pseudobulk read coverages). HPCs, hematopoietic progenitor cells. i, scDesign3 simulated realistic CITE-seq data. Three genes’ protein and RNA abundances are shown on the cell UMAP embeddings in test data (top) and synthetic data (bottom). Large r values represent similar expression patterns in synthetic data and test data. j, scDesign3 generated a multiomics (RNA expression + DNA methylation) dataset (right) by learning from two real single-omics datasets with RNA expression or DNA methylation only (left). The synthetic data preserved the linear cell topology.
The UCLA researchers say they believe scDesign3 “offers the first probabilistic model that unifies the generation and inference for single-cell and spatial omics data. Equipped with interpretable parameters and a model likelihood, scDesign3 is beyond a versatile simulator and has unique advantages for generating customized in silico data, which can serve as negative and positive controls for computational analysis, and for assessing the goodness-of-fit of inferred cell clusters, trajectories, and spatial locations in an unsupervised way.” Goodness-of-fit is a measure of how well a statistical model fits a set of observations.
According to the authors, the system’s “transparent modeling and interpretable parameters can help users explore, alter, and simulate data. Overall, scDesign3 is a multi-functional suite for benchmarking computational methods and interpreting single-cell and spatial omics data.”
Source – UCLA
Availability – The scDesign3 package is available at https://github.com/SONGDONGYUAN1994/scDesign3. The comprehensive tutorials are available at https://songdongyuan1994.github.io/scDesign3/docs/index.html.