Researchers from Johns Hopkins University have developed Boiler, a new software tool for compressing and querying large collections of RNA-seq alignments. Boiler discards most per-read data, keeping only a genomic coverage vector plus a few empirical distributions summarizing the alignments. Since most per-read data is discarded, storage footprint is often much smaller than that achieved by other compression tools. Despite this, the most relevant per-read data can be recovered; they show that Boiler compression has only a slight negative impact on results given by downstream tools for isoform assembly and quantification. Boiler also allows the user to pose fast and useful queries without decompressing the entire file.
Illustration of how Boiler compresses alignments in a bundle, for a dataset with unpaired reads
(a) The genome is divided into “partitions” (colored segments) based on the processed splice sites. A bucket is defined by the subset of partitions spanned (as well as the values of the NH:i and XS:A fields, though these are omitted from the figure for simplicity). Each bucket stores (b) the coverage vector and (c) the length tally of the reads assigned to the bucket.
Availability – Boiler is free open source software available from https://github.com/jpritt/boiler.