CWig: Compressed representation of Wiggle/BedGraph format

BigWig, a format to represent read density data, is one of the most popular data types. They can represent the peak intensity in ChIP-seq, the transcript expression in RNA-seq, the copy number variation in whole genome sequencing, etc. UCSC Encode project uses the bigWig format heavily for storage and visualization. Out of 5.2TB Encode hg19 database, 1.6TB (31% of the total space) is used to store bigWig files. BigWig format not only saves a lot of space, but also supports fast queries that are crucial for interactive analysis and browsing. BigWig often has similar size to the gzipped raw data, while is still able to support about 5 thousands random queries per second.

Although bigWig is good enough at the moment, both storage space and query time are expected to become limited when sequencing gets cheaper. This paper describes a new method to store density data named CWig. The format uses on average one third of the size of existing bigWig files and improves random query speed up to 100 times.




Hoang DH, Sung WK. (2014) CWig: Compressed representation of Wiggle/BedGraph format. Bioinformatics [Epub ahead of print]. [abstract]
