In ecological studies microbial diversity is nowadays mostly assessed via the detection of phylogenetic marker genes such as 16S ribosomal RNA. However, PCR amplification of these marker genes produces a significant amount of artificial sequences often referred to as chimeras. Different algorithms have been developed to remove these chimeras, but efforts to combine different methodologies are limited. Therefore, two machine learning classifiers (CATCh reference and CATCh de novo) were developed by integrating the output of existing chimera detection tools into a new, more powerful method. When comparing these classifiers with existing tools either in reference based or de novo mode, a higher performance of our ensemble method is observed on a wide range of sequencing data, including simulated, 454 pyrosequencing and Illumina MiSeq datasets. Since this algorithm combines the advantages of different individual chimera detection tools, the approach produces more robust results when challenged with chimeric sequences having a low parent divergence, short length of the chimeric range and a varying number of parents. Additionally, it could be shown that integrating CATCh in the preprocessing pipeline has a beneficial effect on the quality of the clustering in operational taxonomic units.
Availability – The CATCh software and accompanying documentation is available at: http://science.sckcen.be/en/Institutes/EHS/MCB/MIC/Bioinformatics/CATCh. The implementation has been tested on Mac and Linux (RHEL-derived distributions).