Single-cell sequencing methods rely on molecule-counting strategies to account for amplification biases, yet no experimental strategy to evaluate counting performance exists. Researchers from the Karolinska Institute introduce molecular spikes—RNA spike-ins containing built-in unique molecular identifiers (UMIs) that they use to identify critical experimental and computational conditions for accurate RNA counting in single-cell RNA-sequencing (scRNA-seq). Using molecular spikes, the researchers uncovered impaired RNA counting in methods that were not informative for cellular RNA abundances due to inflated UMI counts. They further leverage molecular spikes to improve estimates of total endogenous RNA amounts in cells, and introduce a strategy to correct experiments with impaired RNA counting. The molecular spikes and the accompanying R package UMIcountR will improve the validation of new methods, better estimate and adjust for cellular mRNA amounts and enable more indepth characterization of RNA counting in scRNA-seq.
Direct assessment of single-cell RNA counting using molecular spikes
a, Schematic of cloning strategy of molecular spikes, where an oligonucleotide library is inserted into a molecular spike entry vector, and the vector pool is linearized and in vitro transcribed to generate a pool of molecular RNA spike-ins. b, Coordinates of molecular spikes in basepairs (bp), with inbuilt UMI in the 5′ or 3′ end. c, 5′ molecular spike complexity estimated by fitting a nonlinear asymptotic model (dotted line) to unique spUMI sequences observed as a function of the number of spUMIs sequenced across cells (blue line). d, Scatter plot showing error-corrected (hamming distance (HD) 1) Smart-seq3 RNA counts (y axis) against the number of spiked molecules (x axis) ranging from 1 to 100 spiked molecules per cell. Data from HEK293FT cells (n = 48 cells). e, Scatter plot showing number of spiked molecules (x axis) against error-corrected RNA counts (hamming distance 1) for data generated with variations to the Smart-seq3 protocol, that utilize cDNA cleanup before amplification (0.1 µM FWD) or without cleanup and therefore remaining TSO with different concentrations of FWD primer. Data from 39 cells or more are shown per condition. f, Scatter plot showing number of spiked molecules (x axis) against error-corrected RNA counts (hamming distance 1) for 10x Genomics (v.2) data (n = 955 cells). g, Scatter plot showing number of spiked molecules (x axis) against error-corrected RNA counts (hamming distance 1) for data generated with variations to the SCRB-seq and tSCRB-seq protocols. Standard SCRB-seq (green, 53 cells), excluding exonuclease I treatment (red, 77 cells) and direct PCR (tSCRB-seq) (blue, 90 cells). h, Percent counting error (observed/true) for in RNA counts generated with variations to the SCRB-seq and tSCRB-seq protocols. Solid line denotes the mean over cells per condition with the shaded area representing the standard deviation colored by experimental conditions. Direct PCR (tSCRB-seq) (90 cells), No exonuclease I (77 cells) and standard protocol (53 cells). The dotted line represents the expected overcounting if every sequenced read corresponds to a new UMI observation.
Availability – UMIcountR is available at: https://github.com/cziegenhain/UMIcountR