Single-cell sequencing experiments use short DNA barcode ‘tags’ to identify reads that originate from the same cell. In order to recover single-cell information from such experiments, reads must be grouped based on their barcode tag, a crucial processing step that precedes other computations. However, this step can be difficult due to high rates of mismatch and deletion errors that can afflict barcodes.
Here researchers from the California Institute of Technology present an approach to identify and error-correct barcodes by traversing the de Bruijn graph of circularized barcode k-mers. Their approach is based on the observation that circularizing a barcode sequence can yield error-free k-mers even when the size of k is large relative to the length of the barcode sequence, a regime which is typical single-cell barcoding applications. This allows for assignment of reads to consensus fingerprints constructed from k-mers.
The researchers show that for single-cell RNA-Seq circularization improves the recovery of accurate single-cell transcriptome estimates, especially when there are a high number of errors per read. This approach is robust to the type of error (mismatch, insertion, deletion), as well as to the relative abundances of the cells. Sircel, a software package that implements this approach is described and publically available.
A strategy to use k-mer counting to identify sequence barcodes
a Circularizing barcodes ensures robustness against single mismatches. An example sequence ‘BARCODE’ contains an error (highlighted in red). When the barcode sequence is short relative to k, all k-mers from this sequence will contain the mutated base. Circularizing the sequence (bottom) ensures that there will be some error-free k-mers from a sequence independent of the position of the error. b An example circular k-mer graph containing one barcode. Error-containing reads were simulated from a ground-truth barcode. Reads were circularized and k-mers were counted. The resultant k-mer graph is plotted here. Nodes in this graph are represented as gray dots, and edges as blue lines. Edges weights are represented by shading (dark = high edge weight). Despite a fairly high rate of error (Poisson 3 errors per 12 nucleotide barcode), the true barcode path is visually discernable with a modest number of reads. c An example circular k-mer graph containing three barcodes. Same as above
Availability – Project home page: https://github.com/pachterlab/sircel