Kpath ‐ statistical reference-based compression for short reads

Storing, transmitting, and archiving data produced by next generation sequencing is a significant computational burden. New compression techniques tailored to short-read sequence data are needed.

Researchers from Carnegie Mellon University and Stony Brook University have devloped an approach to compression that reduces the difficulty of managing large-scale sequencing data. This novel approach sits between pure reference-based compression and reference-free compression and combines much of the benefit of reference-based approaches with the flexibility of de novo encoding. The new method, called path encoding, draws a connection between storing paths in de Bruijn graphs and context-dependent arithmetic coding. Supporting this method is a system to compactly store sets of kmers that is of independent interest. The researchers were able to encode RNA-seq reads using 3% – 11% of the space of the sequence in raw FASTA files, which is on average more than 34% smaller than competing approaches. They also show that even if the reference is very poorly matched to the reads that are being encoded, good compression can still be achieved.

rna-seq

Availability and implementation: Source code and binaries freely available for download at http://www.cs.cmu.edu/~ckingsf/software/pathenc/, implemented in Go and supported on Linux and Mac OS X.

Contact: carlk@cs.cmu.edu

Kingsford C, Patro R. (2015) Reference-based compression of short-read sequences using path encoding. Bioinformatics [Epub ahead of print]. [article]

Leave a Reply

Your email address will not be published. Required fields are marked *

*

Time limit is exhausted. Please reload CAPTCHA.