Single-cell messenger RNA sequencing (scRNA-seq) has emerged as a powerful tool to study cellular heterogeneity within complex tissues. Subpopulations of cells with common gene expression profiles can be identified by applying unsupervised clustering algorithms. However, technical variance is a major confounding factor in scRNA-seq, not least because it is not possible to replicate measurements on the same cell.
University of Oxford researchers have developed BEARscc, a tool that uses RNA spike-in controls to simulate experiment-specific technical replicates. BEARscc works with a wide range of existing clustering algorithms to assess the robustness of clusters to technical variation. The researchers demonstrate that the tool improves the unsupervised classification of cells and facilitates the biological interpretation of single-cell RNA-seq experiments.
Overview of the BEARscc algorithm
Step 1, the variance of gene expression expected in a replicate experiment is estimated from the variation of spike-in measurements. Top: variation in spike-in read counts corresponds well with experimentally observed variability in biological transcripts (for details of control experiment see Methods) and read counts simulated by BEARscc. Bottom: drop-out likelihood is modelled separately, based on the drop-out rate for spike-ins of a given concentration. Shown is the average percentage drop-out rate as a function of the number of transcripts per sample, for spike-ins, simulated replicates and experimental observations in a control experiment (see Methods). Step 2, simulating technical replicates: the observed gene counts (top matrix) are transformed into multiple simulated technical replicates (bottom) by repeatedly applying the noise model derived in Step 1 to every cell in the matrix. Step 3, calculating a consensus: each simulated replicate (from Step 2) is clustered to create an association matrix. All the association matrices (bottom) are averaged into a single noise consensus matrix (top) that reflects the frequency with which cells are observed in the same cluster across all simulated replicates. Based on this matrix, noise consensus clusters can then be derived (coloured bar above matrix)