TetRex: a novel algorithm for index-accelerated search of highly conserved motifs

The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Int...

Full description

Saved in:
Bibliographic Details
Published inNAR genomics and bioinformatics Vol. 7; no. 2; p. lqaf039
Main Authors Schwab, Remy M, Gottlieb, Simon Gene, Reinert, Knut
Format Journal Article
LanguageEnglish
Published England Oxford University Press 01.06.2025
Subjects
Online AccessGet full text
ISSN2631-9268
2631-9268
DOI10.1093/nargab/lqaf039

Cover

Loading…
More Information
Summary:The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:2631-9268
2631-9268
DOI:10.1093/nargab/lqaf039