Single-cell RNA-seq protocols provide powerful means for examining the gamut of cell types and transcriptional states that comprise complex biological tissues. Recently-developed approaches based on droplet microfluidics, such as inDrop or Drop-seq, use massively multiplexed barcoding to enable simultaneous measurements of transcriptomes for thousands of individual cells. The increasing complexity of such data also creates challenges for subsequent computational processing and troubleshooting of these experiments, with few software options currently available.
Researchers from Harvard Medical School have developed a flexible pipeline for processing droplet-based transcriptome data that implements barcode corrections, classification of cell quality, and diagnostic information about the droplet libraries. They introduce advanced methods for correcting composition bias and sequencing errors affecting cellular and molecular barcodes to provide more accurate estimates of molecular counts in individual cells.
Correcting for Cellular Barcode errors
(A) The number of molecules per cell before and after the merge correction procedure, is shown for the mouse ES dataset as a function of the overall cell size as reflected by rank of the cell (with cells possessing most molecules having the lowest ran; cells with ≤ 500 molecules were omitted). The results of the merge procedure with and without the prior knowledge of the possible barcode list match almost exactly. The resulting distribution shows a pronounced inflection point coinciding with the real number of encapsulated cells. (B) Analysis of human-mouse cell mixtures suggest contribution of extracellular background. The fraction of human-mouse molecule admixture is shown (color gradient) for cells binned by the number of molecules originating from the two genomes. The cells were binned by the number of human molecules (x axis), and the fraction of mouse molecules originating from mouse cells exceeding a certain size threshold (y axis) was calculated for each bin. The majority of admixed molecules originate from small mouse cells, with highest admixture fraction observed between small mouse and small human cells (>75%). (C) The number of equidistant adjacent larger CBs is shown for each of the observed CBs in the mouse
ES dataset. The main shows adjacent CBs selected from a priori known set of valid CB sequences. The inset shows counts of adjacent CBs selected from all CB sequences observed in the dataset. (D) Comparing all pairs of cells with similar CBs (edit distance 1-5) and distant CBs (distance 7-10), the plot shows the dependency between the theoretical probability of a certain number of overlapping UMI gene combinations between the two cells (x axis) with the empirical probability (observed fraction, y axis) of such cells. Cell pairs with similar CBs show much higher overlap, which is driven by CB sequencing errors. A small fraction of distant CBs (0.1%) also shows higher molecular overlap then expected, which is likely explained by cross-droplet contamination.