Enormous databases of short-read RNA-seq experiments such as the NIH Sequencing Read Archive are now available. These databases could answer many questions about condition-specific expression or population variation, and this resource is only going to grow over time. However, these collections remain difficult to use due to the inability to search for a particular expressed sequence. Although some progress has been made on this problem, it is still not feasible to search collections of hundreds of terabytes of short-read sequencing experiments.
Carnegie Mellon University researchers introduce an indexing scheme called split sequence bloom trees (SSBTs) to support sequence-based querying of terabyte scale collections of thousands of short-read sequencing experiments. SSBT is an improvement over the sequence bloom tree (SBT) data structure for the same task. They apply SSBTs to the problem of finding conditions under which query transcripts are expressed. Their experiments are conducted on a set of 2652 publicly available RNA-seq experiments for the breast, blood, and brain tissues. The researchers demonstrate that this SSBT index can be queried for a 1000 nt sequence in <4 minutes using a single thread and can be stored in just 39 GB, a fivefold improvement in search and storage costs compared with SBT.
An example uncompressed and compressed SSBT where black corresponds to a bit value of ‘1’ and white corresponds to a bit value of ‘0’
(a) Grey bits correspond to non-informative bits whose value is known given a parent filter. We see that grey bits are cumulative and exist at all index positions below a on ’1’ in the sim filter or a ‘0’ in the dif filter. When looking up index value 6, each filter is queried until either a sim ‘1’ is found or a dif ‘0’ is found. (b) All non-informative bits have been removed from the uncompressed tree. The lookup for index value 6 is adjusted based on the removed non-informative bits.