University of Kentucky researchers have developed SeqOthello, an ultra-fast and memory-efficient indexing structure to support arbitrary sequence query against large collections of RNA-seq experiments. It takes SeqOthello only 5 min and 19.1 GB memory to conduct a global survey of 11,658 fusion events against 10,113 TCGA Pan-Cancer RNA-seq datasets. The query recovers 92.7% of tier-1 fusions curated by TCGA Fusion Gene Database and reveals 270 novel occurrences, all of which are present as tumor-specific. By providing a reference-free, alignment-free, and parameter-free sequence search system, SeqOthello will enable large-scale integrative studies using sequence-level data, an undertaking not previously practicable for many individual labs.
Overview of SeqOthello structure and query procedure
a An illustration of the SeqOthello indexing structure to support scalable k-mer searching in large-scale sequencing experiments. The bottom level of SeqOthello stores the occurrence maps of individual k-mers, encoded in three different formats and divided into disjoint buckets. The mapping between a k-mer and its occurrence map is achieved by a hierarchy of Othello structures in which the root Othello maps a k-mer to its bucket and the Othello in each bucket maps a k-mer to its occurrence map. b An example illustrating SeqOthello’s sequence query process and output. A sequence query is decomposed into its constituent k-mers. The query result can be either a k-mer hit map, recording each k-mer’s presence/absence along the query sequence, or k-mer hit ratios (i.e., the fraction of query k-mers present in each experiment)
Availability – The source code of SeqOthello is available at Github repository, <https://github.com/LiuBioinfo/SeqOthello>.