Empirical estimation of sequencing error rates using smoothing splines

Next-generation sequencing has been used by investigators to address a diverse range of biological problems through, for example, polymorphism and mutation discovery and microRNA profiling. However, compared to conventional sequencing, the error rates for next-generation sequencing are often higher, which impacts the downstream genomic analysis. Recently, Wang et al. (BMC Bioinformatics 13:185, 2012) proposed a shadow regression approach to estimate the error rates for next-generation sequencing data based on the assumption of a linear relationship between the number of reads sequenced and the number of reads containing errors (denoted as shadows). However, this linear read-shadow relationship may not be appropriate for all types of sequence data. Therefore, it is necessary to estimate the error rates in a more reliable way without assuming linearity. Researchers at MD Anderson Cancer Center proposed an empirical error rate estimation approach that employs cubic and robust smoothing splines to model the relationship between the number of reads sequenced and the number of shadows.

The researchers performed simulation studies using a frequency-based approach to generate the read and shadow counts directly, which can mimic the real sequence counts data structure. Using simulation, they investigated the performance of the proposed approach and compared it to that of shadow linear regression. The proposed approach provided more accurate error rate estimations than the shadow linear regression approach for all the scenarios tested. They also applied the proposed approach to assess the error rates for the sequence data from the MicroArray Quality Control project, a mutation screening study, the Encyclopedia of DNA Elements project, and bacteriophage PhiX DNA samples.

Median error rates in MAQC data using shadow linear regression and smoothing spline approaches

Samples Expected ER SRER SRER Bias EER_CS EER_CS Bias EER_RS EER_RS Bias
SRR037452 0.3305 0.2578 0.0727 0.3104 0.0201 0.3096 0.0209
SRR037453 0.1917 0.1584 0.0333 0.1824 0.0093 0.1818 0.0099
SRR037454 0.2354 0.1515 0.0839 0.2060 0.0294 0.2059 0.0295
SRR037455 0.1759 0.1448 0.0311 0.1675 0.0084 0.1668 0.0091
SRR037456 0.2312 0.1622 0.0690 0.2037 0.0275 0.2035 0.0277
SRR037457 0.1841 0.1480 0.0361 0.1777 0.0064 0.1771 0.0070
SRR037458 0.2653 0.2321 0.0332 0.2582 0.0071 0.2575 0.0078
SRR037459 0.2371 0.1943 0.0428 0.2202 0.0169 0.2203 0.0168
SRR037460 0.2530 0.2018 0.0512 0.2503 0.0027 0.2490 0.0040
SRR037461 0.2180 0.1704 0.0476 0.2105 0.0075 0.2104 0.0076
SRR037462 0.2443 0.1734 0.0709 0.2322 0.0121 0.2308 0.0135
SRR037463 0.2154 0.1654 0.0500 0.2023 0.0131 0.2045 0.0109
SRR037464 0.2624 0.1666 0.0958 0.2392 0.0232 0.2403 0.0221
SRR037465 0.2145 0.1742 0.0403 0.2038 0.0107 0.2037 0.0108

 

The proposed empirical error rate estimation approach does not assume a linear relationship between the error-free read and shadow counts and provides more accurate estimations of error rates for next-generation, short-read sequencing data.

Zhu X, Wang J, Peng B, Shete S. (2016) Empirical estimation of sequencing error rates using smoothing splines. BMC Bioinformatics 17:177. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.