Unsupervised machine learning methods (deep learning) have shown their usefulness with noisy single cell mRNA-sequencing data (scRNA-seq), where the models generalize well, despite the zero-inflation of the data. A class of neural networks, namely autoencoders, has been useful for denoising of single cell data, imputation of missing values and dimensionality reduction.
University of Copenhagen researchers present a striking feature with the potential to greatly increase the usability of autoencoders: With specialized training, the autoencoder is not only able to generalize over the data, but also to tease apart biologically meaningful modules, which the researchers found encoded in the representation layer of the network. This model can, from scRNA-seq data, delineate biological meaningful modules that govern a dataset, as well as give information as to which modules are active in each single cell. Importantly, most of these modules can be explained by known biological functions, as provided by the Hallmark gene sets.
The researchers discover that tailored training of an autoencoder makes it possible to deconvolute biological modules inherent in the data, without any assumptions. By comparisons with gene signatures of canonical pathways they see that the modules are directly interpretable. The scope of this discovery has important implications, as it makes it possible to outline the drivers behind a given effect of a cell. In comparison with other dimensionality reduction methods, or supervised models for classification, this approach has the benefit of both handling well the zero-inflated nature of scRNA-seq, and validating that the model captures relevant information, by establishing a link between input and decoded data. In perspective, this model in combination with clustering methods is able to provide information about which subtype a given single cell belongs to, as well as which biological functions determine that membership.
General overview of the approach
Expression data act as input to the autoencoder (b) which models the data. The model’s representation of the data set can be visualized by a dimensionality reduction plot (c). The impact of gene sets of interest to our representation method can be visualized, either for the whole data set (d) or for a comparison between two groups of cells (e). b: A general outlook of an autoencoder artificial neural network. The autoencoder shown has an input, a hidden and an output layer, but it is common that it contains more hidden layers. Usually the hidden layer in the middle of the network acts as the representation layer, which contains the compressed information of the original data. The representation is decompressed in the output layer, where the input is recreated with some accuracy. a & c: Uniform Manifold Approximation and Projection (UMAP) of Paul et al. The UMAP of the original input data is visualized on (a) and UMAP of the evaluation of the representation layer, after training is done, is visualized on (c). We can see that the neighboring structure of the original input data is retained in the representation layer. d & e: Heatmaps of the impact of the Hallmark molecular pathways on the representation layer of the autoencoder trained on Paul et al. The impact is computed via saliency maps (see Methods section). To enhance visual clarity, only the high impact pathways are visualized. We plot the impact of the gene signatures for the whole dataset (d) and for the comparison between two groups of the dataset, CMP CD41 and Cebpe control, which also includes differentiated cells (e). The comparison is done by subtracting the impact of the hallmark pathways of one group versus the other. The difference in impact is overlaid on the “general” heatmap (d)