TetRex: a novel algorithm for index-accelerated search of highly conserved motifs

The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Int...

Full description

Saved in:

Bibliographic Details
Published in	NAR genomics and bioinformatics Vol. 7; no. 2; p. lqaf039
Main Authors	Schwab, Remy M, Gottlieb, Simon Gene, Reinert, Knut
Format	Journal Article
Language	English
Published	England Oxford University Press 01.06.2025
Subjects	Algorithms Amino Acid Motifs Computational Biology - methods Conserved Sequence Editor's Choice Methods Software
Online Access	Get full text
ISSN	2631-9268 2631-9268
DOI	10.1093/nargab/lqaf039

Cover

Abstract	The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
AbstractList	The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools. The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
Author	Gottlieb, Simon Gene Reinert, Knut Schwab, Remy M
Author_xml	– sequence: 1 givenname: Remy M surname: Schwab fullname: Schwab, Remy M – sequence: 2 givenname: Simon Gene surname: Gottlieb fullname: Gottlieb, Simon Gene – sequence: 3 givenname: Knut orcidid: 0000-0003-3078-8129 surname: Reinert fullname: Reinert, Knut
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/40248489$$D View this record in MEDLINE/PubMed
BookMark	eNpVUUtLAzEYDFKxWnv1KHv0sjavfcSLiPgCQRA9hzT5sruym9hkW-q_8bf4y1xplXr6Pphhhpk5QiPnHSB0QvA5wYLNnAqVms_ahbKYiT10SHNGUkHzcrTzj9E0xjeMMc14xjE5QGOOKS95KQ7R8wv0z7C-SFTi_AraRLWVD01fd4n14euzcQbWqdIaWgiqB5NEUEHXibdJ3VR1-5Fo7yKE1QB1vm9sPEb7VrURpts7Qa-3Ny_X9-nj093D9dVjqhkv-zTLDKjckLllmlEwOWEFY4QBZIIywzk1TAurSyuoVtoUPCNgOQwpSgoFZRN0udF9X847MBpcH1Qr30PTqfAhvWrkf8Q1taz8ShKKMadDPxN0tlUIfrGE2MuuiUPSVjnwyygZEYQJQWgxUE93zf5cfpscCOcbgg4-xgD2j0Kw_FlLbtaS27XYN823i_0
Cites_doi	10.1093/bib/3.3.265 10.1038/s41592-021-01101-x 10.1186/s13059-023-02971-4 10.1093/bioinformatics/btae097 10.1093/nar/gkab1113 10.1093/bioinformatics/bty191 10.1101/gr.275648.121 10.1109/TKDE.2020.2992295 10.1093/protein/gzg044 10.1145/363347.363387 10.3390/a14050133 10.1145/1031171.1031212 10.1093/protein/13.3.149 10.1093/bioinformatics/bth408 10.1016/S0066-4138(63)80015-4 10.1109/ICDE.2002.994755 10.1016/j.isci.2021.102782
ContentType	Journal Article
Copyright	The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. 2025
Copyright_xml	– notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. – notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. 2025
DBID	AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 5PM
DOI	10.1093/nargab/lqaf039
DatabaseName	CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic PubMed Central (Full Participant titles)
DatabaseTitle	CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic
DatabaseTitleList	MEDLINE - Academic MEDLINE CrossRef
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	fulltext_linktorsrc
EISSN	2631-9268
ExternalDocumentID	PMC12004226 40248489 10_1093_nargab_lqaf039
Genre	Journal Article
GroupedDBID	0R~ 53G AAFWJ AAPXW AAVAP AAYXX ABEJV ABGNP ABPTD ABXVV AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS AMNDL BBNVY BENPR BHPHI CCPQU CITATION EBS EMOBN GROUPED_DOAJ HCIFZ IAO KSI M7P M~E PHGZM PHGZT PIMPY PQGLB PUEGO RPM TOX CGR CUY CVF ECM EIF IGS IHR INH ITC NPM 7X8 5PM
ID	FETCH-LOGICAL-c348t-55dea6d1bf3c32ed61373313ee5923d442d3c9fc8f92cacd7451ef4e25482e723
ISSN	2631-9268
IngestDate	Thu Aug 21 18:30:43 EDT 2025 Fri Sep 05 17:33:35 EDT 2025 Sun Apr 20 01:21:03 EDT 2025 Wed Sep 10 04:15:57 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2
Language	English
License	https://creativecommons.org/licenses/by/4.0 The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c348t-55dea6d1bf3c32ed61373313ee5923d442d3c9fc8f92cacd7451ef4e25482e723
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ORCID	0000-0003-3078-8129
OpenAccessLink	http://dx.doi.org/10.1093/nargab/lqaf039
PMID	40248489
PQID	3191399127
PQPubID	23479
ParticipantIDs	pubmedcentral_primary_oai_pubmedcentral_nih_gov_12004226 proquest_miscellaneous_3191399127 pubmed_primary_40248489 crossref_primary_10_1093_nargab_lqaf039
PublicationCentury	2000
PublicationDate	2025-06-01
PublicationDateYYYYMMDD	2025-06-01
PublicationDate_xml	– month: 06 year: 2025 text: 2025-06-01 day: 01
PublicationDecade	2020
PublicationPlace	England
PublicationPlace_xml	– name: England
PublicationTitle	NAR genomics and bioinformatics
PublicationTitleAlternate	NAR Genom Bioinform
PublicationYear	2025
Publisher	Oxford University Press
Publisher_xml	– name: Oxford University Press
References	Tsang (2025041706361262100_B3) Thompson (2025041706361262100_B12) 1968; 11 Hauswedell (2025041706361262100_B20) 2024; 40 Seiler (2025041706361262100_B11) 2021; 24 Murphy (2025041706361262100_B18) 2000; 13 Cho (2025041706361262100_B4) 2002; 3 Castro-Mondragon (2025041706361262100_B2) 2022; 50 Holtgrewe (2025041706361262100_B14) 2010 Buchfink (2025041706361262100_B19) 2021; 18 Li (2025041706361262100_B17) 2003; 16 Dijkstra (2025041706361262100_B13) 1963 Mehringer (2025041706361262100_B10) 2023; 24 Sahlin (2025041706361262100_B16) 2021; 31 Gibney (2025041706361262100_B9) 2021; 14 Roberts (2025041706361262100_B15) 2004; 20 Li (2025041706361262100_B7) 2018; 34 Sigrist (2025041706361262100_B1) 2002; 3 Qiu (2025041706361262100_B6) 2022; 34 Hore (2025041706361262100_B5) 2004 Gattinger (2025041706361262100_B8) 2002; 1
References_xml	– volume: 3 start-page: 265 year: 2002 ident: 2025041706361262100_B1 article-title: PROSITE: a documented database using patterns and profiles as motif descriptors publication-title: Brief Bioinform doi: 10.1093/bib/3.3.265 – volume: 18 start-page: 366 year: 2021 ident: 2025041706361262100_B19 article-title: Sensitive protein alignments at tree-of-life scale using DIAMOND publication-title: Nat Methods doi: 10.1038/s41592-021-01101-x – ident: 2025041706361262100_B3 article-title: An index for regular expression queries: design and implementation – volume: 24 start-page: 131 year: 2023 ident: 2025041706361262100_B10 article-title: Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries publication-title: Genome Biol doi: 10.1186/s13059-023-02971-4 – volume: 40 start-page: btae097 year: 2024 ident: 2025041706361262100_B20 article-title: Lambda3: homology search for protein, nucleotide and bisulfite-converted sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/btae097 – volume: 50 start-page: D165 year: 2022 ident: 2025041706361262100_B2 article-title: JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles publication-title: Nucleic Acids Res doi: 10.1093/nar/gkab1113 – volume: 34 start-page: 3094 year: 2018 ident: 2025041706361262100_B7 article-title: Minimap2: pairwise alignment for nucleotide sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/bty191 – year: 2010 ident: 2025041706361262100_B14 article-title: Mason: a read simulator for second generation sequencing data – volume: 31 start-page: 2080 year: 2021 ident: 2025041706361262100_B16 article-title: Effective sequence similarity detection with strobemers publication-title: Genome Res doi: 10.1101/gr.275648.121 – volume: 34 start-page: 1133 year: 2022 ident: 2025041706361262100_B6 article-title: Efficient regular expression matching based on positional inverted index publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2020.2992295 – volume: 16 start-page: 323 year: 2003 ident: 2025041706361262100_B17 article-title: Reduction of protein sequence complexity by residue grouping publication-title: Protein Eng doi: 10.1093/protein/gzg044 – volume: 11 start-page: 419 year: 1968 ident: 2025041706361262100_B12 article-title: Regular expression search algorithm publication-title: Commun ACM doi: 10.1145/363347.363387 – volume: 14 start-page: 133 year: 2021 ident: 2025041706361262100_B9 article-title: Text indexing for regular expression matching publication-title: Algorithms doi: 10.3390/a14050133 – start-page: 198 volume-title: Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management year: 2004 ident: 2025041706361262100_B5 article-title: Indexing text data under space constraints doi: 10.1145/1031171.1031212 – volume: 1 start-page: 107 year: 2002 ident: 2025041706361262100_B8 article-title: ScanProsite: a reference implementation of a PROSITE scanning tool publication-title: Appl Bioinform – volume: 13 start-page: 149 year: 2000 ident: 2025041706361262100_B18 article-title: Simplified amino acid alphabets for protein fold recognition and implications for folding publication-title: Protein Eng doi: 10.1093/protein/13.3.149 – volume: 20 start-page: 3363 year: 2004 ident: 2025041706361262100_B15 article-title: Reducing storage requirements for biological sequence comparison publication-title: Bioinformatics doi: 10.1093/bioinformatics/bth408 – year: 1963 ident: 2025041706361262100_B13 article-title: Algol 60 translation : an Algol 60 translator for the X1 and making a translator for Algol 60 doi: 10.1016/S0066-4138(63)80015-4 – volume: 3 start-page: 419 year: 2002 ident: 2025041706361262100_B4 article-title: A fast regular expression indexing engine publication-title: Proc Int Conf Data Eng doi: 10.1109/ICDE.2002.994755 – volume: 24 start-page: 102782 year: 2021 ident: 2025041706361262100_B11 article-title: Raptor: a fast and space-efficient pre-filter for querying very large collections of nucleotide sequences publication-title: iScience doi: 10.1016/j.isci.2021.102782
SSID	ssj0002545401
Score	2.2928126
Snippet	The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set...
SourceID	pubmedcentral proquest pubmed crossref
SourceType	Open Access Repository Aggregation Database Index Database
StartPage	lqaf039
SubjectTerms	Algorithms Amino Acid Motifs Computational Biology - methods Conserved Sequence Editor's Choice Methods Software
Title	TetRex: a novel algorithm for index-accelerated search of highly conserved motifs
URI	https://www.ncbi.nlm.nih.gov/pubmed/40248489 https://www.proquest.com/docview/3191399127 https://pubmed.ncbi.nlm.nih.gov/PMC12004226
Volume	7
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3bbtQwELWglVBfEIjbclkZiYqHKu36khtvS9WqArFC21bqW-Q4NlupmxSaBcrX8C18GTN2kma7RQJeolWcdaKZI8-MPXOGkFc6jFSacBYkMrGBZIoFimsTjMB08TwGC-wO2j9MooNj-e4kPOlVXGN1SZ1v6x831pX8j1bhHugVq2T_QbPdpHADfoN-4Qoahuvf6djUU_Pd1yuX1VeD2cafKgj3Z3NMH9zc5ZvjkeNDDJTWYGCQF6LYanY60E2E0PzsElPPcXMWhjA1z170PdbJeIptlrF42dM556dVw7Za9zLlD_Xsm8q9wuaXWz3e_boGN9eNHAIuSkd03R3zGKw9dHbgfbmo-1sQPLxKldo2bqnikWBByn2DnHZdjXvw4b018uyzsiPPYLSyfntuqxJ7_MKH7d_wKMj0fO70KZGOTfoGRNc4s9uh22Sdx7E7vm93cdBCQ1AMjirrODzFjn_jTvO-DXKnnWHZXVmJQa6n0vZ8k6N75G4TVNCxR8h9csuUD8jUo-MNVdRhg3bYoKC9Xz9XcEE9LmhlqccF7XBBPS4ekuP9vaPdg6DpoBFoIZM6CMPCqKhguRVacFOA74Y9OoUxITj2hZS8EDq1OrEp10oXsQyZsdKAgBJuYi4ekbWyKs0TQvF_uUgseLBMahHlEDiAc5cUkWVRbsyAvG4FlZ17opTMJziIzEs3a6Q7IC9bOWawluEBlSpNtbjIwBxAQJIyHg_IYy_Xbq5WIQOSLEm8ewB50pdHytOZ40tn3DHdRU__OOkzsnGF6-dkrf6yMC_A2azzIVl_uzf5OB26zZqhw9Fv7X-H4A
linkProvider	National Library of Medicine
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TetRex%3A+a+novel+algorithm+for%C2%A0index-accelerated+search+of+highly+conserved+motifs&rft.jtitle=NAR+genomics+and+bioinformatics&rft.au=Schwab%2C+Remy+M&rft.au=Gottlieb%2C+Simon+Gene&rft.au=Reinert%2C+Knut&rft.date=2025-06-01&rft.eissn=2631-9268&rft.volume=7&rft.issue=2&rft.spage=lqaf039&rft_id=info:doi/10.1093%2Fnargab%2Flqaf039&rft_id=info%3Apmid%2F40248489&rft.externalDocID=40248489
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2631-9268&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2631-9268&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2631-9268&client=summon