Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at a cellular resolution. However, noise due to amplification and dropout may obstruct analyses, so scalable denoising methods for increasingly large but sparse scRNAseq data are needed.
Researchers at Helmholtz Zentrum München propose a deep count autoencoder network (DCA) to denoise scRNA-seq datasets. DCA takes the count distribution, overdispersion and sparsity of the data into account using a zero-inflated negative binomial noise model, and nonlinear gene-gene or gene-dispersion interactions are captured. Their method scales linearly with the number of cells and can therefore be applied to datasets of millions of cells. The researchers demonstrate that DCA denoising improves a diverse set of typical scRNA-seq data analyses using simulated and real datasets. DCA outperforms existing methods for data imputation in quality and speed, enhancing biological discovery.
DCA denoises scRNA-seq data by learning the data manifold using an autoencoder framework
Panel A depicts a schematic of the denoising process adapted from Goodfellow et al.26. Red arrows illustrate how a corruption process, i.e. measurement noise from dropout events, moves datapoints away from the data manifold (black line). The autoencoder is trained to denoise the data by mapping corrupted data points back onto the data manifold (green arrows). Filled blue dots represent corrupted data points. Empty blue points represent the data points without noise. Panel B shows the autoencoder with a ZINB loss function. Input is the original count matrix (pink rectangle; gene by cells matrix, with dark blue indicating zero counts) and the mean matrix of the negative binomial component represents the denoised output (blue rectangle). Input counts, mean, dispersion and dropout probabilitieare denoted as x, μ, θ and π. respectively.
Availability – DCA, including usage tutorial, can be downloaded from: https://github.com/theislab/dca