Single-cell RNA sequencing (scRNA-seq) has enabled researchers to study gene expression at a cellular resolution. However, noise due to amplification and dropout may obstruct analyses, so scalable denoising methods for increasingly large but sparse scRNA-seq data are needed. Researchers from Helmholtz Zentrum München propose a deep count autoencoder network (DCA) to denoise scRNA-seq datasets. DCA takes the count distribution, overdispersion and sparsity of the data into account using a negative binomial noise model with or without zero-inflation, and nonlinear gene-gene dependencies are captured. This method scales linearly with the number of cells and can, therefore, be applied to datasets of millions of cells. The researchers demonstrate that DCA denoising improves a diverse set of typical scRNA-seq data analyses using simulated and real datasets. DCA outperforms existing methods for data imputation in quality and speed, enhancing biological discovery.
DCA denoises scRNA-seq data by learning the underlying true zero-noise data manifold using an autoencoder framework
a Depicts a schematic of the denoising process adapted from Goodfellow et al.. Red arrows illustrate how a corruption process, i.e. measurement noise including dropout events, moves data points xj away from the data manifold (black line). The autoencoder is trained to denoise the data by mapping measurement-corrupted data points back onto the data manifold (green arrows). Filled blue dots represent corrupted data points. Empty blue points represent the data points without noise. b Shows the autoencoder with a ZINB loss function. Input is the original count matrix (pink rectangle; gene by cells matrix, with dark blue indicating zero counts) with six genes (pink nodes) for illustration purposes. The blue nodes depict the mean of the negative binomial distribution which is the main output of the method representing denoised data, whereas the green and red nodes represent the other two parameters of the ZINB distribution, namely dispersion and dropout. Note that output nodes for mean, dispersion and dropout also consist of six genes which match six input genes. The matrix highlighted in blue shows the mean value for all cells which denotes the denoised expression. and the mean matrix of the negative binomial component represents the denoised output (blue rectangle). Input counts, mean, dispersion and dropout probabilities are denoted as x, μ, θ and π, respectively
Availability – The approach is implemented in Python and as a command line tool, publicly available at https://github.com/theislab/dca. Alternatively, Scanpy users can directly use the “dca” method in the preprocessing package[https://scanpy.readthedocs.io/en/latest/api/index.html#imputation].