Integration of millions of transcriptomes using batch-aware triplet neural networks

Efficient integration of heterogeneous and increasingly large single-cell RNA sequencing data poses a major challenge for analysis and, in particular, comprehensive atlasing efforts. Researchers from The University of Texas Health Science Center at Houston developed a novel deep learning algorithm called INSCT (Insight) to overcome batch effects using batch-aware triplet neural networks. The researchers use simulated and real data to demonstrate that INSCT generates an embedding space that accurately integrates cells across experiments, platforms and species. These benchmark comparisons with current state-of-the-art single-cell RNA sequencing integration methods revealed that INSCT outperforms competing methods in scalability while achieving comparable accuracies. Moreover, using INSCT in semisupervised mode enables users to classify unlabelled cells by projecting them into a reference collection of annotated cells. To demonstrate scalability, the researchers applied INSCT to integrate more than 2.6 million transcriptomes from four independent studies of mouse brains in less than 1.5 h using less than 25 GB of memory. This feature empowers researchers to perform atlasing-scale data integration in a typical desktop computer environment.

Overview of INSCT

Fig. 1

a, INSCT learns a data representation, which integrates cells across batches. The goal of the network is to minimize the distance between Anchor and Positive while maximizing the distance between Anchor and Negative. Anchor and Positive pairs consist of transcriptionally similar cells from different batches. The Negative is a transcriptomically dissimilar cell sampled from the same batch as the Anchor. b, Principal components of three data points corresponding to Anchor, Positive and Negative are fed into three identical neural networks, which share weights. The triplet loss function is used to train the network weights and the two-dimensional embedding layer activations represent the integrated embedding.

Availability – INSCT is freely available at https://github.com/lkmklsmn/insct.

Simon LM, Wang YY, Zhao Z. (2021) Integration of millions of transcriptomes using batch-aware triplet neural networks. Nat Mach Intell [Epub ahead of print]. [abstract]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.