Comparing diverse single-cell RNA sequencing (scRNA-seq) datasets generated by different technologies and in different laboratories remains a major challenge. Loma Linda University researchers address the need for guidance in choosing algorithms leading to accurate biological interpretations of varied data types acquired with different platforms. Using two well-characterized cellular reference samples (breast cancer cells and B cells), captured either separately or in mixtures, the researchers compared different scRNA-seq platforms and several preprocessing, normalization and batch-effect correction methods at multiple centers. Although preprocessing and normalization contributed to variability in gene detection and cell classification, batch-effect correction was by far the most important factor in correctly classifying the cells. Moreover, scRNA-seq dataset characteristics (for example, sample and cellular heterogeneity and platform used) were critical in determining the optimal bioinformatic method. However, reproducibility across centers and platforms was high when appropriate bioinformatic methods were applied. These findings offer practical guidance for optimizing platform and software selection when designing an scRNA-seq study.
Overall study design, scRNA-seq mapping and numbers of genes detected across datasets
a, Schematic overview of the study design. Two reference cell lines (sample A, HCC1395; sample B, HCC1395BL) were used to generate scRNA-seq data across four platforms (10x Genomics, Fluidigm C1 HT, Fluidigm C1 and Takara Bio ICELL8) and four testing sites (LLU, NCI, FDA and TBU). At the LLU and NCI sites (10x), mixed single-cell captures and library constructions were also prepared with either 10% or 5% cancer cells spiked into the B lymphocytes. At the NCI site, single-cell captures and library constructions were also performed with two methanol-fixed cell mixtures (5% cancer cells spiked into B lymphocytes, termed fixed_1 and fixed_2). One set of 10x scRNA libraries from the NCI was also sequenced using a shorter modified sequencing method. BK RNA-seq data were also obtained from these cell lines, each in triplicate. b, For both the breast cancer cell line (sample A) and the B lymphocyte line (sample B) across 14 pairwise datasets, percentages are shown of reads that mapped to the exonic region (blue) or the non-exonic region (orange) or did not map to the human genome (gray). For UMI methods (10x), dark blue indicates the exonic reads with UMIs. c, Median number of genes detected per cell at different sequencing read depths.