Counting molecules using next-generation sequencing (NGS) suffers from PCR amplification bias, which reduces the accuracy of many quantitative NGS-based experimental methods such as RNASeq. This is true even if molecules are made distinguishable using unique molecular identifiers (UMIs) before PCR amplification, and distinct UMIs are counted instead of reads: Molecules that are lost entirely during the sequencing process will still cause underestimation of the molecule count, and amplification artifacts like PCR chimeras create phantom UMIs and thus cause over-estimation.
University of Vienna researchers introduce the TRUmiCount algorithm to correct for both types of errors. The TRUmiCount algorithm is based on a mechanistic model of PCR amplification and sequencing, whose two parameters have an immediate physical interpretation as PCR efficiency and sequencing depth and can be estimated from experimental data without requiring calibration experiments or spike-ins. We show that our model captures the main stochastic properties of amplification and sequencing, and that it allows us to filter out phantom UMIs and to estimate the number of molecules lost during the sequencing process. Finally, we demonstrate that the phantom-filtered and loss-corrected molecule counts computed by TRUmiCount measure the true number of molecules with considerably higher accuracy than the raw number of distinct UMIs, even if most UMIs are sequenced only once as is typical for single-cell RNA-Seq.
A . The relevant steps of library preparation when the UMI method is used. The sample initially contains 3 copies of molecule and 2 copies of , which are made unique by labelling with UMIs ( • , • , • , • , • ). Each of those molecules is expanded into a molecular family during amplification, and a random selection of molecules from these families are sequenced. Counting unique UMIs then counts unique molecules, unless UMIs have read-count zero ( • ) or phantom UMIs are produced ( • • ). B . PCR as a Galton-Watson branching process. Molecule • failed to be copied during the 1 st PCR cycle and the final family size is thus reduced compared to • . C . Normalized family size distribution for efficiency 10%, 50% and 90%. The arrows mark the most likely normalized family sizes for the two molecules from (B), assuming a reaction efficiency of 90% , and taking their distinct fates during the 1 st PCR cycle into account. D . Distribution of reads per UMI for efficiency 10%, 50% and 90% assuming D = 4 Reads per UMI on average.