TetRex: a novel algorithm for index-accelerated search of highly conserved motifs

The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Int...

Full description

Saved in:
Bibliographic Details
Published inNAR genomics and bioinformatics Vol. 7; no. 2; p. lqaf039
Main Authors Schwab, Remy M, Gottlieb, Simon Gene, Reinert, Knut
Format Journal Article
LanguageEnglish
Published England Oxford University Press 01.06.2025
Subjects
Online AccessGet full text
ISSN2631-9268
2631-9268
DOI10.1093/nargab/lqaf039

Cover

Abstract The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
AbstractList The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
Author Gottlieb, Simon Gene
Reinert, Knut
Schwab, Remy M
Author_xml – sequence: 1
  givenname: Remy M
  surname: Schwab
  fullname: Schwab, Remy M
– sequence: 2
  givenname: Simon Gene
  surname: Gottlieb
  fullname: Gottlieb, Simon Gene
– sequence: 3
  givenname: Knut
  orcidid: 0000-0003-3078-8129
  surname: Reinert
  fullname: Reinert, Knut
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40248489$$D View this record in MEDLINE/PubMed
BookMark eNpVUUtLAzEYDFKxWnv1KHv0sjavfcSLiPgCQRA9hzT5sruym9hkW-q_8bf4y1xplXr6Pphhhpk5QiPnHSB0QvA5wYLNnAqVms_ahbKYiT10SHNGUkHzcrTzj9E0xjeMMc14xjE5QGOOKS95KQ7R8wv0z7C-SFTi_AraRLWVD01fd4n14euzcQbWqdIaWgiqB5NEUEHXibdJ3VR1-5Fo7yKE1QB1vm9sPEb7VrURpts7Qa-3Ny_X9-nj093D9dVjqhkv-zTLDKjckLllmlEwOWEFY4QBZIIywzk1TAurSyuoVtoUPCNgOQwpSgoFZRN0udF9X847MBpcH1Qr30PTqfAhvWrkf8Q1taz8ShKKMadDPxN0tlUIfrGE2MuuiUPSVjnwyygZEYQJQWgxUE93zf5cfpscCOcbgg4-xgD2j0Kw_FlLbtaS27XYN823i_0
Cites_doi 10.1093/bib/3.3.265
10.1038/s41592-021-01101-x
10.1186/s13059-023-02971-4
10.1093/bioinformatics/btae097
10.1093/nar/gkab1113
10.1093/bioinformatics/bty191
10.1101/gr.275648.121
10.1109/TKDE.2020.2992295
10.1093/protein/gzg044
10.1145/363347.363387
10.3390/a14050133
10.1145/1031171.1031212
10.1093/protein/13.3.149
10.1093/bioinformatics/bth408
10.1016/S0066-4138(63)80015-4
10.1109/ICDE.2002.994755
10.1016/j.isci.2021.102782
ContentType Journal Article
Copyright The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. 2025
Copyright_xml – notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
– notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. 2025
DBID AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
7X8
5PM
DOI 10.1093/nargab/lqaf039
DatabaseName CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
PubMed Central (Full Participant titles)
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE - Academic
MEDLINE

CrossRef
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
EISSN 2631-9268
ExternalDocumentID PMC12004226
40248489
10_1093_nargab_lqaf039
Genre Journal Article
GroupedDBID 0R~
53G
AAFWJ
AAPXW
AAVAP
AAYXX
ABEJV
ABGNP
ABPTD
ABXVV
AFKRA
AFPKN
ALMA_UNASSIGNED_HOLDINGS
AMNDL
BBNVY
BENPR
BHPHI
CCPQU
CITATION
EBS
EMOBN
GROUPED_DOAJ
HCIFZ
IAO
KSI
M7P
M~E
PHGZM
PHGZT
PIMPY
PQGLB
PUEGO
RPM
TOX
CGR
CUY
CVF
ECM
EIF
IGS
IHR
INH
ITC
NPM
7X8
5PM
ID FETCH-LOGICAL-c348t-55dea6d1bf3c32ed61373313ee5923d442d3c9fc8f92cacd7451ef4e25482e723
ISSN 2631-9268
IngestDate Thu Aug 21 18:30:43 EDT 2025
Fri Sep 05 17:33:35 EDT 2025
Sun Apr 20 01:21:03 EDT 2025
Wed Sep 10 04:15:57 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
License https://creativecommons.org/licenses/by/4.0
The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c348t-55dea6d1bf3c32ed61373313ee5923d442d3c9fc8f92cacd7451ef4e25482e723
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0003-3078-8129
OpenAccessLink http://dx.doi.org/10.1093/nargab/lqaf039
PMID 40248489
PQID 3191399127
PQPubID 23479
ParticipantIDs pubmedcentral_primary_oai_pubmedcentral_nih_gov_12004226
proquest_miscellaneous_3191399127
pubmed_primary_40248489
crossref_primary_10_1093_nargab_lqaf039
PublicationCentury 2000
PublicationDate 2025-06-01
PublicationDateYYYYMMDD 2025-06-01
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-06-01
  day: 01
PublicationDecade 2020
PublicationPlace England
PublicationPlace_xml – name: England
PublicationTitle NAR genomics and bioinformatics
PublicationTitleAlternate NAR Genom Bioinform
PublicationYear 2025
Publisher Oxford University Press
Publisher_xml – name: Oxford University Press
References Tsang (2025041706361262100_B3)
Thompson (2025041706361262100_B12) 1968; 11
Hauswedell (2025041706361262100_B20) 2024; 40
Seiler (2025041706361262100_B11) 2021; 24
Murphy (2025041706361262100_B18) 2000; 13
Cho (2025041706361262100_B4) 2002; 3
Castro-Mondragon (2025041706361262100_B2) 2022; 50
Holtgrewe (2025041706361262100_B14) 2010
Buchfink (2025041706361262100_B19) 2021; 18
Li (2025041706361262100_B17) 2003; 16
Dijkstra (2025041706361262100_B13) 1963
Mehringer (2025041706361262100_B10) 2023; 24
Sahlin (2025041706361262100_B16) 2021; 31
Gibney (2025041706361262100_B9) 2021; 14
Roberts (2025041706361262100_B15) 2004; 20
Li (2025041706361262100_B7) 2018; 34
Sigrist (2025041706361262100_B1) 2002; 3
Qiu (2025041706361262100_B6) 2022; 34
Hore (2025041706361262100_B5) 2004
Gattinger (2025041706361262100_B8) 2002; 1
References_xml – volume: 3
  start-page: 265
  year: 2002
  ident: 2025041706361262100_B1
  article-title: PROSITE: a documented database using patterns and profiles as motif descriptors
  publication-title: Brief Bioinform
  doi: 10.1093/bib/3.3.265
– volume: 18
  start-page: 366
  year: 2021
  ident: 2025041706361262100_B19
  article-title: Sensitive protein alignments at tree-of-life scale using DIAMOND
  publication-title: Nat Methods
  doi: 10.1038/s41592-021-01101-x
– ident: 2025041706361262100_B3
  article-title: An index for regular expression queries: design and implementation
– volume: 24
  start-page: 131
  year: 2023
  ident: 2025041706361262100_B10
  article-title: Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries
  publication-title: Genome Biol
  doi: 10.1186/s13059-023-02971-4
– volume: 40
  start-page: btae097
  year: 2024
  ident: 2025041706361262100_B20
  article-title: Lambda3: homology search for protein, nucleotide and bisulfite-converted sequences
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btae097
– volume: 50
  start-page: D165
  year: 2022
  ident: 2025041706361262100_B2
  article-title: JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkab1113
– volume: 34
  start-page: 3094
  year: 2018
  ident: 2025041706361262100_B7
  article-title: Minimap2: pairwise alignment for nucleotide sequences
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bty191
– year: 2010
  ident: 2025041706361262100_B14
  article-title: Mason: a read simulator for second generation sequencing data
– volume: 31
  start-page: 2080
  year: 2021
  ident: 2025041706361262100_B16
  article-title: Effective sequence similarity detection with strobemers
  publication-title: Genome Res
  doi: 10.1101/gr.275648.121
– volume: 34
  start-page: 1133
  year: 2022
  ident: 2025041706361262100_B6
  article-title: Efficient regular expression matching based on positional inverted index
  publication-title: IEEE Trans Knowl Data Eng
  doi: 10.1109/TKDE.2020.2992295
– volume: 16
  start-page: 323
  year: 2003
  ident: 2025041706361262100_B17
  article-title: Reduction of protein sequence complexity by residue grouping
  publication-title: Protein Eng
  doi: 10.1093/protein/gzg044
– volume: 11
  start-page: 419
  year: 1968
  ident: 2025041706361262100_B12
  article-title: Regular expression search algorithm
  publication-title: Commun ACM
  doi: 10.1145/363347.363387
– volume: 14
  start-page: 133
  year: 2021
  ident: 2025041706361262100_B9
  article-title: Text indexing for regular expression matching
  publication-title: Algorithms
  doi: 10.3390/a14050133
– start-page: 198
  volume-title: Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management
  year: 2004
  ident: 2025041706361262100_B5
  article-title: Indexing text data under space constraints
  doi: 10.1145/1031171.1031212
– volume: 1
  start-page: 107
  year: 2002
  ident: 2025041706361262100_B8
  article-title: ScanProsite: a reference implementation of a PROSITE scanning tool
  publication-title: Appl Bioinform
– volume: 13
  start-page: 149
  year: 2000
  ident: 2025041706361262100_B18
  article-title: Simplified amino acid alphabets for protein fold recognition and implications for folding
  publication-title: Protein Eng
  doi: 10.1093/protein/13.3.149
– volume: 20
  start-page: 3363
  year: 2004
  ident: 2025041706361262100_B15
  article-title: Reducing storage requirements for biological sequence comparison
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bth408
– year: 1963
  ident: 2025041706361262100_B13
  article-title: Algol 60 translation : an Algol 60 translator for the X1 and making a translator for Algol 60
  doi: 10.1016/S0066-4138(63)80015-4
– volume: 3
  start-page: 419
  year: 2002
  ident: 2025041706361262100_B4
  article-title: A fast regular expression indexing engine
  publication-title: Proc Int Conf Data Eng
  doi: 10.1109/ICDE.2002.994755
– volume: 24
  start-page: 102782
  year: 2021
  ident: 2025041706361262100_B11
  article-title: Raptor: a fast and space-efficient pre-filter for querying very large collections of nucleotide sequences
  publication-title: iScience
  doi: 10.1016/j.isci.2021.102782
SSID ssj0002545401
Score 2.2928126
Snippet The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set...
SourceID pubmedcentral
proquest
pubmed
crossref
SourceType Open Access Repository
Aggregation Database
Index Database
StartPage lqaf039
SubjectTerms Algorithms
Amino Acid Motifs
Computational Biology - methods
Conserved Sequence
Editor's Choice
Methods
Software
Title TetRex: a novel algorithm for index-accelerated search of highly conserved motifs
URI https://www.ncbi.nlm.nih.gov/pubmed/40248489
https://www.proquest.com/docview/3191399127
https://pubmed.ncbi.nlm.nih.gov/PMC12004226
Volume 7
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3bbtQwELWglVBfEIjbclkZiYqHKu36khtvS9WqArFC21bqW-Q4NlupmxSaBcrX8C18GTN2kma7RQJeolWcdaKZI8-MPXOGkFc6jFSacBYkMrGBZIoFimsTjMB08TwGC-wO2j9MooNj-e4kPOlVXGN1SZ1v6x831pX8j1bhHugVq2T_QbPdpHADfoN-4Qoahuvf6djUU_Pd1yuX1VeD2cafKgj3Z3NMH9zc5ZvjkeNDDJTWYGCQF6LYanY60E2E0PzsElPPcXMWhjA1z170PdbJeIptlrF42dM556dVw7Za9zLlD_Xsm8q9wuaXWz3e_boGN9eNHAIuSkd03R3zGKw9dHbgfbmo-1sQPLxKldo2bqnikWBByn2DnHZdjXvw4b018uyzsiPPYLSyfntuqxJ7_MKH7d_wKMj0fO70KZGOTfoGRNc4s9uh22Sdx7E7vm93cdBCQ1AMjirrODzFjn_jTvO-DXKnnWHZXVmJQa6n0vZ8k6N75G4TVNCxR8h9csuUD8jUo-MNVdRhg3bYoKC9Xz9XcEE9LmhlqccF7XBBPS4ekuP9vaPdg6DpoBFoIZM6CMPCqKhguRVacFOA74Y9OoUxITj2hZS8EDq1OrEp10oXsQyZsdKAgBJuYi4ekbWyKs0TQvF_uUgseLBMahHlEDiAc5cUkWVRbsyAvG4FlZ17opTMJziIzEs3a6Q7IC9bOWawluEBlSpNtbjIwBxAQJIyHg_IYy_Xbq5WIQOSLEm8ewB50pdHytOZ40tn3DHdRU__OOkzsnGF6-dkrf6yMC_A2azzIVl_uzf5OB26zZqhw9Fv7X-H4A
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TetRex%3A+a+novel+algorithm+for%C2%A0index-accelerated+search+of+highly+conserved+motifs&rft.jtitle=NAR+genomics+and+bioinformatics&rft.au=Schwab%2C+Remy+M&rft.au=Gottlieb%2C+Simon+Gene&rft.au=Reinert%2C+Knut&rft.date=2025-06-01&rft.eissn=2631-9268&rft.volume=7&rft.issue=2&rft.spage=lqaf039&rft_id=info:doi/10.1093%2Fnargab%2Flqaf039&rft_id=info%3Apmid%2F40248489&rft.externalDocID=40248489
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2631-9268&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2631-9268&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2631-9268&client=summon