As the cost of single-cell RNA-seq experiments has decreased, an increasing number of datasets are now available. Combining newly generated and publicly accessible datasets is challenging due to non-biological signals, commonly known as batch effects. Although there are several computational methods available that can remove batch effects, evaluating which method performs best is not straightforward.
Researchers from the Wellcome Sanger Institute have developed BatchBench, a modular and flexible pipeline for comparing batch correction methods for single-cell RNA-seq data. The researchers apply BatchBench to eight methods, highlighting their methodological differences and assess their performance and computational requirements through a compendium of well-studied datasets. This systematic comparison guides users in the choice of batch correction tool, and the pipeline makes it easy to evaluate other datasets.
Overview of the BatchBench pipeline workflow and schematic representation of the conventional scRNA-seq data analysis pipeline from the expression matrix
(A) Batchbench first carries out QC on the input dataset prior to performing batch correction with the eight methods selected. After this, a series of downstream analyses are computed, including: UMAP coordinates, Shannon entropies, clustering and marker gene analysis, and resource consumption metrics of each of the processes. (B) Central and lower panels depict the conventional scRNA-seq data analysis pipeline and the analyses that can be carried out with the output of each step. Upper panel represents the space over which each of the batch correction methods operate. The initial expression matrix typically undergoes feature selection, being then source for gene based analyses, as marker gene and pseudotime analysis or gene networks. Methods mnnCorrect, Limma, ComBat, Seurat and Scanorama operate in the expression matrix space. Next, a dimensionality reduction step is performed. Methods Harmony and fastMNN operate in this space. The low dimensional embedding is then converted into a matrix of cell-cell distances which in turn can be converted to a graph. These are inputs for cell based analysis as clustering, visualization and trajectory inference of cells. BBKNN method operates in this graph space.
Availability – https://github.com/cellgeni/batchbench