MONI: A Pangenomic Index for Finding Maximal Exact Matches
Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the -index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact...
Saved in:
Published in | Journal of computational biology Vol. 29; no. 2; p. 169 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
United States
01.02.2022
|
Subjects | |
Online Access | Get more information |
ISSN | 1557-8666 |
DOI | 10.1089/cmb.2021.0290 |
Cover
Abstract | Recently, Gagie et al. proposed a version of the FM-index, called the
-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the
-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the
-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the
-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the
-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called
can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references. |
---|---|
AbstractList | Recently, Gagie et al. proposed a version of the FM-index, called the
-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the
-index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the
-index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the
-index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the
-index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called
can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references. |
Author | Rossi, Massimiliano Gagie, Travis Langmead, Ben Boucher, Christina Oliva, Marco |
Author_xml | – sequence: 1 givenname: Massimiliano orcidid: 0000-0002-3012-1394 surname: Rossi fullname: Rossi, Massimiliano organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA – sequence: 2 givenname: Marco orcidid: 0000-0003-0525-3114 surname: Oliva fullname: Oliva, Marco organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA – sequence: 3 givenname: Ben orcidid: 0000-0003-2437-1976 surname: Langmead fullname: Langmead, Ben organization: Department of Computer Science, John Hopkins University, Baltimore, Maryland, USA – sequence: 4 givenname: Travis orcidid: 0000-0003-3689-327X surname: Gagie fullname: Gagie, Travis organization: Faculty of Computer Science, Dalhousie University, Halifax, Canada – sequence: 5 givenname: Christina orcidid: 0000-0001-9509-9725 surname: Boucher fullname: Boucher, Christina organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/35041495$$D View this record in MEDLINE/PubMed |
BookMark | eNo1j0tLw0AYRQdR7EOXbmX-QOK8vplJd6W0GmitC12XedZIMylJhPjvG1BXlwuHy7kzdJ2aFBB6oCSnRBdPrrY5I4zmhBXkCk0pgMq0lHKCZl33RQjlkqhbNOFABBUFTNFit38tF3iJ30w6htTUlcNl8mHAsWnxpkq-Ske8M0NVmxNeD8b1Y-vdZ-ju0E00py7c_-UcfWzW76uXbLt_LlfLbeYE8D4DAB2CBa-8jqBk9LEAwYyyGkAQIRSTURZOWMEFYYJxZyyjVBcjqghnc_T4u3v-tnXwh3M7urQ_h_8T7AKLqEXv |
CitedBy_id | crossref_primary_10_1146_annurev_genom_021623_081639 crossref_primary_10_1093_bioinformatics_btae717 crossref_primary_10_1016_j_isci_2024_111464 crossref_primary_10_1093_bioinformatics_btae213 crossref_primary_10_1038_s41467_024_55762_1 crossref_primary_10_1093_bioinformatics_btad552 crossref_primary_10_1145_3701561 crossref_primary_10_1186_s13059_023_02958_1 crossref_primary_10_1089_cmb_2021_0445 crossref_primary_10_1016_j_ic_2024_105153 crossref_primary_10_1016_j_ic_2024_105155 crossref_primary_10_1186_s13059_023_02969_y crossref_primary_10_1371_journal_pcbi_1012665 crossref_primary_10_1093_bioinformatics_btad460 crossref_primary_10_1186_s13015_023_00225_3 crossref_primary_10_1186_s13015_025_00272_y crossref_primary_10_1093_bioadv_vbae113 crossref_primary_10_1101_gr_279143_124 crossref_primary_10_1016_j_isci_2024_110933 |
ContentType | Journal Article |
DBID | CGR CUY CVF ECM EIF NPM |
DOI | 10.1089/cmb.2021.0290 |
DatabaseName | Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed |
DatabaseTitle | MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) |
DatabaseTitleList | MEDLINE |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database |
DeliveryMethod | no_fulltext_linktorsrc |
Discipline | Biology Mathematics |
EISSN | 1557-8666 |
ExternalDocumentID | 35041495 |
Genre | Research Support, U.S. Gov't, Non-P.H.S Research Support, Non-U.S. Gov't Journal Article Research Support, N.I.H., Extramural |
GrantInformation_xml | – fundername: NIAID NIH HHS grantid: R01 AI141810 – fundername: NHGRI NIH HHS grantid: R01 HG011392 |
GroupedDBID | --- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1 |
ID | FETCH-LOGICAL-c453t-5558eeb5d7d8f576fdf9542a7b8554044726f69c4b43402423cab211896fd7032 |
IngestDate | Thu Apr 03 06:56:30 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 2 |
Keywords | thresholds run-length-encoded Burrows-Wheeler transform MEM-finding r-index |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c453t-5558eeb5d7d8f576fdf9542a7b8554044726f69c4b43402423cab211896fd7032 |
ORCID | 0000-0002-3012-1394 0000-0003-3689-327X 0000-0003-0525-3114 0000-0003-2437-1976 0000-0001-9509-9725 |
OpenAccessLink | https://www.ncbi.nlm.nih.gov/pmc/articles/8892979 |
PMID | 35041495 |
ParticipantIDs | pubmed_primary_35041495 |
PublicationCentury | 2000 |
PublicationDate | 2022-02-00 |
PublicationDateYYYYMMDD | 2022-02-01 |
PublicationDate_xml | – month: 02 year: 2022 text: 2022-02-00 |
PublicationDecade | 2020 |
PublicationPlace | United States |
PublicationPlace_xml | – name: United States |
PublicationTitle | Journal of computational biology |
PublicationTitleAlternate | J Comput Biol |
PublicationYear | 2022 |
SSID | ssj0013607 |
Score | 2.4997575 |
Snippet | Recently, Gagie et al. proposed a version of the FM-index, called the
-index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et... |
SourceID | pubmed |
SourceType | Index Database |
StartPage | 169 |
SubjectTerms | Algorithms Computational Biology Databases, Genetic - statistics & numerical data Genome, Bacterial Genome, Human Genomics - statistics & numerical data High-Throughput Nucleotide Sequencing - statistics & numerical data Humans Salmonella - genetics Sequence Alignment - statistics & numerical data Sequence Analysis, DNA - statistics & numerical data Software Wavelet Analysis |
Title | MONI: A Pangenomic Index for Finding Maximal Exact Matches |
URI | https://www.ncbi.nlm.nih.gov/pubmed/35041495 |
Volume | 29 |
hasFullText | |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA4-UPQgvt-Sg9dqbdM2601EXcUVD7vgbckTeugquMjqr3cmSbc-8XEpJWlL6PdlmpnmmyFkPzW8sJaLKI5tGjFW6Eigs6JSLo2VXOYp6p07N3m7x67usrsmlO3UJUN5oF6-1JX8B1VoA1xRJfsHZMcPhQY4B3zhCAjD8VcYgz289NLyW9QIOIUxzHhtRm734HnpJSsdMSorzB88QkFkRyBQj9-sSpWr8lBHCEOKpuavDMwgr_CBk6rEEMl9E6Ytn0SQ_6hx6zUMrKp51MjOLkQo8QzfyqfyXewB3NZ4vI_DBHuZwUcu94VTaoMaQhjlG7_WW8cjX5Xlk9WOOSY9VZUEfz3BHKq-gugbBB8qB2GaxQxdup97PyTRrrsmyWRRoP2-waBO_bMpj4uQfhVGcvhuHHNktr73g-PhFiDdRbIQMKInngZLZMIMlsmMryX6vEzmO-MEvI8r5BipcUxPaEMM6ohBgRg0EIMGYlBHDBqIsUp652fd03YUymREimXpMMKMbcbITBeaW3AfrbatjCWikLgFMYb5l-Q2bykmWcpwSZYqIcHv5y24FAx-skamBvcDs0FoZizX8CAwxYLxhAlYjeaGC82NOtI63iTr_hX0H3wulH79cra-7dkmcw11dsi0hclndmElN5R7DodXtG9FOQ |
linkProvider | National Library of Medicine |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MONI%3A+A+Pangenomic+Index+for+Finding+Maximal+Exact+Matches&rft.jtitle=Journal+of+computational+biology&rft.au=Rossi%2C+Massimiliano&rft.au=Oliva%2C+Marco&rft.au=Langmead%2C+Ben&rft.au=Gagie%2C+Travis&rft.date=2022-02-01&rft.eissn=1557-8666&rft.volume=29&rft.issue=2&rft.spage=169&rft_id=info:doi/10.1089%2Fcmb.2021.0290&rft_id=info%3Apmid%2F35041495&rft_id=info%3Apmid%2F35041495&rft.externalDocID=35041495 |