TetRex: a novel algorithm for index-accelerated search of highly conserved motifs
The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Int...
Saved in:
Published in | NAR genomics and bioinformatics Vol. 7; no. 2; p. lqaf039 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
England
Oxford University Press
01.06.2025
|
Subjects | |
Online Access | Get full text |
ISSN | 2631-9268 2631-9268 |
DOI | 10.1093/nargab/lqaf039 |
Cover
Abstract | The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools. |
---|---|
AbstractList | The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools. The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools. |
Author | Gottlieb, Simon Gene Reinert, Knut Schwab, Remy M |
Author_xml | – sequence: 1 givenname: Remy M surname: Schwab fullname: Schwab, Remy M – sequence: 2 givenname: Simon Gene surname: Gottlieb fullname: Gottlieb, Simon Gene – sequence: 3 givenname: Knut orcidid: 0000-0003-3078-8129 surname: Reinert fullname: Reinert, Knut |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/40248489$$D View this record in MEDLINE/PubMed |
BookMark | eNpVUUtLAzEYDFKxWnv1KHv0sjavfcSLiPgCQRA9hzT5sruym9hkW-q_8bf4y1xplXr6Pphhhpk5QiPnHSB0QvA5wYLNnAqVms_ahbKYiT10SHNGUkHzcrTzj9E0xjeMMc14xjE5QGOOKS95KQ7R8wv0z7C-SFTi_AraRLWVD01fd4n14euzcQbWqdIaWgiqB5NEUEHXibdJ3VR1-5Fo7yKE1QB1vm9sPEb7VrURpts7Qa-3Ny_X9-nj093D9dVjqhkv-zTLDKjckLllmlEwOWEFY4QBZIIywzk1TAurSyuoVtoUPCNgOQwpSgoFZRN0udF9X847MBpcH1Qr30PTqfAhvWrkf8Q1taz8ShKKMadDPxN0tlUIfrGE2MuuiUPSVjnwyygZEYQJQWgxUE93zf5cfpscCOcbgg4-xgD2j0Kw_FlLbtaS27XYN823i_0 |
Cites_doi | 10.1093/bib/3.3.265 10.1038/s41592-021-01101-x 10.1186/s13059-023-02971-4 10.1093/bioinformatics/btae097 10.1093/nar/gkab1113 10.1093/bioinformatics/bty191 10.1101/gr.275648.121 10.1109/TKDE.2020.2992295 10.1093/protein/gzg044 10.1145/363347.363387 10.3390/a14050133 10.1145/1031171.1031212 10.1093/protein/13.3.149 10.1093/bioinformatics/bth408 10.1016/S0066-4138(63)80015-4 10.1109/ICDE.2002.994755 10.1016/j.isci.2021.102782 |
ContentType | Journal Article |
Copyright | The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. 2025 |
Copyright_xml | – notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. – notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. 2025 |
DBID | AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 5PM |
DOI | 10.1093/nargab/lqaf039 |
DatabaseName | CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic PubMed Central (Full Participant titles) |
DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
DatabaseTitleList | MEDLINE - Academic MEDLINE CrossRef |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
EISSN | 2631-9268 |
ExternalDocumentID | PMC12004226 40248489 10_1093_nargab_lqaf039 |
Genre | Journal Article |
GroupedDBID | 0R~ 53G AAFWJ AAPXW AAVAP AAYXX ABEJV ABGNP ABPTD ABXVV AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS AMNDL BBNVY BENPR BHPHI CCPQU CITATION EBS EMOBN GROUPED_DOAJ HCIFZ IAO KSI M7P M~E PHGZM PHGZT PIMPY PQGLB PUEGO RPM TOX CGR CUY CVF ECM EIF IGS IHR INH ITC NPM 7X8 5PM |
ID | FETCH-LOGICAL-c348t-55dea6d1bf3c32ed61373313ee5923d442d3c9fc8f92cacd7451ef4e25482e723 |
ISSN | 2631-9268 |
IngestDate | Thu Aug 21 18:30:43 EDT 2025 Fri Sep 05 17:33:35 EDT 2025 Sun Apr 20 01:21:03 EDT 2025 Wed Sep 10 04:15:57 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 2 |
Language | English |
License | https://creativecommons.org/licenses/by/4.0 The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (https://creativecommons.org/licenses/by/4.0/), which permits unrestricted reuse, distribution, and reproduction in any medium, provided the original work is properly cited. |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c348t-55dea6d1bf3c32ed61373313ee5923d442d3c9fc8f92cacd7451ef4e25482e723 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ORCID | 0000-0003-3078-8129 |
OpenAccessLink | http://dx.doi.org/10.1093/nargab/lqaf039 |
PMID | 40248489 |
PQID | 3191399127 |
PQPubID | 23479 |
ParticipantIDs | pubmedcentral_primary_oai_pubmedcentral_nih_gov_12004226 proquest_miscellaneous_3191399127 pubmed_primary_40248489 crossref_primary_10_1093_nargab_lqaf039 |
PublicationCentury | 2000 |
PublicationDate | 2025-06-01 |
PublicationDateYYYYMMDD | 2025-06-01 |
PublicationDate_xml | – month: 06 year: 2025 text: 2025-06-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | England |
PublicationPlace_xml | – name: England |
PublicationTitle | NAR genomics and bioinformatics |
PublicationTitleAlternate | NAR Genom Bioinform |
PublicationYear | 2025 |
Publisher | Oxford University Press |
Publisher_xml | – name: Oxford University Press |
References | Tsang (2025041706361262100_B3) Thompson (2025041706361262100_B12) 1968; 11 Hauswedell (2025041706361262100_B20) 2024; 40 Seiler (2025041706361262100_B11) 2021; 24 Murphy (2025041706361262100_B18) 2000; 13 Cho (2025041706361262100_B4) 2002; 3 Castro-Mondragon (2025041706361262100_B2) 2022; 50 Holtgrewe (2025041706361262100_B14) 2010 Buchfink (2025041706361262100_B19) 2021; 18 Li (2025041706361262100_B17) 2003; 16 Dijkstra (2025041706361262100_B13) 1963 Mehringer (2025041706361262100_B10) 2023; 24 Sahlin (2025041706361262100_B16) 2021; 31 Gibney (2025041706361262100_B9) 2021; 14 Roberts (2025041706361262100_B15) 2004; 20 Li (2025041706361262100_B7) 2018; 34 Sigrist (2025041706361262100_B1) 2002; 3 Qiu (2025041706361262100_B6) 2022; 34 Hore (2025041706361262100_B5) 2004 Gattinger (2025041706361262100_B8) 2002; 1 |
References_xml | – volume: 3 start-page: 265 year: 2002 ident: 2025041706361262100_B1 article-title: PROSITE: a documented database using patterns and profiles as motif descriptors publication-title: Brief Bioinform doi: 10.1093/bib/3.3.265 – volume: 18 start-page: 366 year: 2021 ident: 2025041706361262100_B19 article-title: Sensitive protein alignments at tree-of-life scale using DIAMOND publication-title: Nat Methods doi: 10.1038/s41592-021-01101-x – ident: 2025041706361262100_B3 article-title: An index for regular expression queries: design and implementation – volume: 24 start-page: 131 year: 2023 ident: 2025041706361262100_B10 article-title: Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries publication-title: Genome Biol doi: 10.1186/s13059-023-02971-4 – volume: 40 start-page: btae097 year: 2024 ident: 2025041706361262100_B20 article-title: Lambda3: homology search for protein, nucleotide and bisulfite-converted sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/btae097 – volume: 50 start-page: D165 year: 2022 ident: 2025041706361262100_B2 article-title: JASPAR 2022: the 9th release of the open-access database of transcription factor binding profiles publication-title: Nucleic Acids Res doi: 10.1093/nar/gkab1113 – volume: 34 start-page: 3094 year: 2018 ident: 2025041706361262100_B7 article-title: Minimap2: pairwise alignment for nucleotide sequences publication-title: Bioinformatics doi: 10.1093/bioinformatics/bty191 – year: 2010 ident: 2025041706361262100_B14 article-title: Mason: a read simulator for second generation sequencing data – volume: 31 start-page: 2080 year: 2021 ident: 2025041706361262100_B16 article-title: Effective sequence similarity detection with strobemers publication-title: Genome Res doi: 10.1101/gr.275648.121 – volume: 34 start-page: 1133 year: 2022 ident: 2025041706361262100_B6 article-title: Efficient regular expression matching based on positional inverted index publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2020.2992295 – volume: 16 start-page: 323 year: 2003 ident: 2025041706361262100_B17 article-title: Reduction of protein sequence complexity by residue grouping publication-title: Protein Eng doi: 10.1093/protein/gzg044 – volume: 11 start-page: 419 year: 1968 ident: 2025041706361262100_B12 article-title: Regular expression search algorithm publication-title: Commun ACM doi: 10.1145/363347.363387 – volume: 14 start-page: 133 year: 2021 ident: 2025041706361262100_B9 article-title: Text indexing for regular expression matching publication-title: Algorithms doi: 10.3390/a14050133 – start-page: 198 volume-title: Proceedings of the 2004 ACM CIKM International Conference on Information and Knowledge Management year: 2004 ident: 2025041706361262100_B5 article-title: Indexing text data under space constraints doi: 10.1145/1031171.1031212 – volume: 1 start-page: 107 year: 2002 ident: 2025041706361262100_B8 article-title: ScanProsite: a reference implementation of a PROSITE scanning tool publication-title: Appl Bioinform – volume: 13 start-page: 149 year: 2000 ident: 2025041706361262100_B18 article-title: Simplified amino acid alphabets for protein fold recognition and implications for folding publication-title: Protein Eng doi: 10.1093/protein/13.3.149 – volume: 20 start-page: 3363 year: 2004 ident: 2025041706361262100_B15 article-title: Reducing storage requirements for biological sequence comparison publication-title: Bioinformatics doi: 10.1093/bioinformatics/bth408 – year: 1963 ident: 2025041706361262100_B13 article-title: Algol 60 translation : an Algol 60 translator for the X1 and making a translator for Algol 60 doi: 10.1016/S0066-4138(63)80015-4 – volume: 3 start-page: 419 year: 2002 ident: 2025041706361262100_B4 article-title: A fast regular expression indexing engine publication-title: Proc Int Conf Data Eng doi: 10.1109/ICDE.2002.994755 – volume: 24 start-page: 102782 year: 2021 ident: 2025041706361262100_B11 article-title: Raptor: a fast and space-efficient pre-filter for querying very large collections of nucleotide sequences publication-title: iScience doi: 10.1016/j.isci.2021.102782 |
SSID | ssj0002545401 |
Score | 2.2928126 |
Snippet | The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set... |
SourceID | pubmedcentral proquest pubmed crossref |
SourceType | Open Access Repository Aggregation Database Index Database |
StartPage | lqaf039 |
SubjectTerms | Algorithms Amino Acid Motifs Computational Biology - methods Conserved Sequence Editor's Choice Methods Software |
Title | TetRex: a novel algorithm for index-accelerated search of highly conserved motifs |
URI | https://www.ncbi.nlm.nih.gov/pubmed/40248489 https://www.proquest.com/docview/3191399127 https://pubmed.ncbi.nlm.nih.gov/PMC12004226 |
Volume | 7 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3bbtQwELWglVBfEIjbclkZiYqHKu36khtvS9WqArFC21bqW-Q4NlupmxSaBcrX8C18GTN2kma7RQJeolWcdaKZI8-MPXOGkFc6jFSacBYkMrGBZIoFimsTjMB08TwGC-wO2j9MooNj-e4kPOlVXGN1SZ1v6x831pX8j1bhHugVq2T_QbPdpHADfoN-4Qoahuvf6djUU_Pd1yuX1VeD2cafKgj3Z3NMH9zc5ZvjkeNDDJTWYGCQF6LYanY60E2E0PzsElPPcXMWhjA1z170PdbJeIptlrF42dM556dVw7Za9zLlD_Xsm8q9wuaXWz3e_boGN9eNHAIuSkd03R3zGKw9dHbgfbmo-1sQPLxKldo2bqnikWBByn2DnHZdjXvw4b018uyzsiPPYLSyfntuqxJ7_MKH7d_wKMj0fO70KZGOTfoGRNc4s9uh22Sdx7E7vm93cdBCQ1AMjirrODzFjn_jTvO-DXKnnWHZXVmJQa6n0vZ8k6N75G4TVNCxR8h9csuUD8jUo-MNVdRhg3bYoKC9Xz9XcEE9LmhlqccF7XBBPS4ekuP9vaPdg6DpoBFoIZM6CMPCqKhguRVacFOA74Y9OoUxITj2hZS8EDq1OrEp10oXsQyZsdKAgBJuYi4ekbWyKs0TQvF_uUgseLBMahHlEDiAc5cUkWVRbsyAvG4FlZ17opTMJziIzEs3a6Q7IC9bOWawluEBlSpNtbjIwBxAQJIyHg_IYy_Xbq5WIQOSLEm8ewB50pdHytOZ40tn3DHdRU__OOkzsnGF6-dkrf6yMC_A2azzIVl_uzf5OB26zZqhw9Fv7X-H4A |
linkProvider | National Library of Medicine |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=TetRex%3A+a+novel+algorithm+for%C2%A0index-accelerated+search+of+highly+conserved+motifs&rft.jtitle=NAR+genomics+and+bioinformatics&rft.au=Schwab%2C+Remy+M&rft.au=Gottlieb%2C+Simon+Gene&rft.au=Reinert%2C+Knut&rft.date=2025-06-01&rft.eissn=2631-9268&rft.volume=7&rft.issue=2&rft.spage=lqaf039&rft_id=info:doi/10.1093%2Fnargab%2Flqaf039&rft_id=info%3Apmid%2F40248489&rft.externalDocID=40248489 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2631-9268&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2631-9268&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2631-9268&client=summon |