Identifying variants using high-throughput sequencing data is currently a challenge because true biological variants can be indistinguishable from technical artifacts. One source of technical artifact results from incorrectly aligning experimentally observed sequences to their true genomic origin (‘mismapping’) and inferring differences in mismapped sequences to be true variants.
A team led by researchers at the University of North Carolina developed BlackOPs, an open-source tool that simulates experimental RNA-seq and DNA whole exome sequences derived from the reference genome, aligns these sequences by custom parameters, detects variants and outputs a blacklist of positions and alleles caused by mismapping. Blacklists contain thousands of artifact variants that are indistinguishable from true variants and, for a given sample, are expected to be almost completely false positives. They show that these blacklist positions are specific to the alignment algorithm and read length used, and BlackOPs allows users to generate a blacklist specific to their experimental setup. They queried the dbSNP and COSMIC variant databases and found numerous variants indistinguishable from mapping errors. They demonstrate how filtering against blacklist positions reduces the number of potential false variants using an RNA-seq glioblastoma cell line data set. In summary, accounting for mapping-caused variants tuned to experimental setups reduces false positives and, therefore, improves genome characterization by high-throughput sequencing.
Availability – BlackOPs is open-source software, written in Perl, distributed under GPL-3.0, and available at http://sourceforge.net/projects/rnaseqvariantbl/. All blacklists produced for this manuscript are also available at this site.
- Cabanski CR, Wilkerson MD, Soloway M, Parker JS, Liu J, Prins JF, Marron JS, Perou CM, Hayes DN. (2013) BlackOPs: increasing confidence in variant detection through mappability filtering. Nucleic Acids Res [Epub ahead of print]. [article]