zUMIs – A fast and flexible pipeline to process RNA sequencing data with UMIs

Single cell RNA-seq (scRNA-seq) experiments typically analyze hundreds or thousands of cells after amplification of the cDNA. The high throughput is made possible by the early introduction of sample-specific barcodes (BCs) and the amplification bias is alleviated by unique molecular identifiers (UMIs). Thus the ideal analysis pipeline for scRNA-seq data needs to efficiently tabulate reads according to both BC and UMI.

zUMIs is a pipeline that can handle both known and random BCs and also efficiently collapses UMIs, either just for Exon mapping reads or for both Exon and Intron mapping reads. If BC annotation is missing, zUMIs can accurately detect intact cells from the distribution of sequencing reads. Another unique feature of zUMIs is the adaptive downsampling function, that facilitates dealing with hugely varying library sizes, but also allows to evaluate whether the library has been sequenced to saturation. To illustrate the utility of zUMIs, researchers from Ludwig-Maximilians University analysed a single-nucleus RNA-seq dataset and show that more than 35% of all reads map to Introns. We furthermore show that these intronic reads are informative about expression levels, significantly increasing the number of detected genes and improving the cluster resolution.

Schematic of the zUMIs pipeline


Each of the grey panels from left to right depicts a step of the zUMIs pipeline. First, fastq ­les are ­ltered according to user-de­ned barcode (BC) and unique molecular identi­er (UMI) quality thresholds. Next, the remaining cDNA reads are mapped to the reference genome using STAR. Gene-wise read and UMI count tables are generated for exon, intron and exon+intron overlapping reads. To obtain comparable library sizes, reads can be downsampled to a desired range during the counting step. In addition, zUMIs also generates data and plots for several quality measures, such as the number of detected genes/UMIs per barcode and distribution of reads into mapping feature categories.

zUMIs flexibility allows to accommodate data generated with any of the major scRNA-seq protocols that use BCs and UMIs and is the most feature-rich, fast and user-friendly pipeline to process such scRNA-seq data.

Availability: https://github.com/sdparekh/zUMIs

Parekh S, Ziegenhain C, Vieth B, Enard W, Hellmann I. (2018) zUMIs – A fast and flexible pipeline to process RNA sequencing data with UMIs. Gigascience [Epub ahead of print]. [abstract]

