The recent rapid spread of single cell RNA sequencing (scRNA-seq) methods has created a large variety of experimental and computational pipelines for which best practices have not been established, yet. Here, researchers from the Ludwig-Maximilians University use simulations based on five scRNA-seq library protocols in combination with nine realistic differential expression (DE) setups to systematically evaluate three mapping, four imputation, seven normalisation and four differential expression testing approaches resulting in ∼ 3,000 pipelines, allowing them to also assess interactions among pipeline steps. The researchers find that choices of normalisation and library preparation protocols have the biggest impact on scRNA-seq analyses. Specifically, they find that library preparation determines the ability to detect symmetric expression differences, while normalisation dominates pipeline performance in asymmetric DE-setups. Finally, the researchers illustrate the importance of informed choices by showing that a good scRNA-seq pipeline can have the same impact on detecting a biological signal as quadrupling the sample size.
A) The data sets yielding raw count matrices. We use scRNA-seq data sets from Ziegenhain et al. 2 and Zheng et al. 16 representing 5 popular library preparation protocols. For each data set, we obtain multiple gene count matrices that result from various combinations of alignment methods and annotation schemes (see also Supplementary Figure S1 and S2, and Supplementary Table S1 and S2). B) The simulation setup. Using powsimR Vieth et al. 10 distribution estimates from real count matrices, we simulate the expression of 10,000 genes for two groups with 384 vs 384, 96 vs. 96 and 50 vs. 200 cells, where 5%, 20% or 60% of genes are DE between groups. The magnitude of expression change for each gene is drawn from a narrow gamma distribution (X ∼ Γ(α = 1, β = 2)) and the directions can either be symmetric, asymmetric or completely asymmetric. To introduce slight variation in expression capture, we draw a different size factor for each cell from a narrow normal distribution. C) The analysis pipeline. The simulated data sets are then analysed using combinations of four count matrix preprocessing, seven normalisation and four DE approaches. The evaluation of these pipelines focuses on the outcome of the confusion matrix and its derivatives (TPR, FDR, pAUC, MCC), deviance in library size estimates (RMSE) and computational run time.