Integration of single-cell RNA sequencing (scRNA-seq) data from multiple experiments, laboratories and technologies can uncover biological insights, but current methods for scRNA-seq data integration are limited by a requirement for datasets to derive from functionally similar cells. MIT researchers present Scanorama, an algorithm that identifies and merges the shared cell types among all pairs of datasets and accurately integrates heterogeneous collections of scRNA-seq data. They applied Scanorama to integrate and remove batch effects across 105,476 cells from 26 diverse scRNA-seq experiments representing 9 different technologies. Scanorama is sensitive to subtle temporal changes within the same cell lineage, successfully integrating functionally similar cells across time series data of CD14+ monocytes at different stages of differentiation into macrophages. Finally, the researchers show that Scanorama is orders of magnitude faster than existing techniques and can integrate a collection of 1,095,538 cells in just ~9 h.
Illustration of ‘panoramic’ dataset integration
a, A panorama stitching algorithm finds and merges overlapping images to create a larger, combined image. b, A similar strategy can also be used to merge heterogeneous scRNA-seq datasets. Scanorama searches nearest neighbors to identify shared cell types among all pairs of datasets. Dimensionality reduction techniques and an approximate nearest-neighbors algorithm based on hyperplane locality sensitive hashing and random projection trees greatly accelerates the search step. Mutually linked cells form matches that can be leveraged to correct for batch effects and merge experiments together, whereby the datasets forming connected components on the basis of these matches become a scRNA-seq ‘panorama’.