MONI: A Pangenomic Index for Finding Maximal Exact Matches

Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the -index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computational biology Vol. 29; no. 2; p. 169
Main Authors	Rossi, Massimiliano, Oliva, Marco, Langmead, Ben, Gagie, Travis, Boucher, Christina
Format	Journal Article
Language	English
Published	United States 01.02.2022
Subjects	Algorithms Computational Biology Databases, Genetic - statistics & numerical data Genome, Bacterial Genome, Human Genomics - statistics & numerical data High-Throughput Nucleotide Sequencing - statistics & numerical data Humans Salmonella - genetics Sequence Alignment - statistics & numerical data Sequence Analysis, DNA - statistics & numerical data Software Wavelet Analysis thresholds run-length-encoded Burrows-Wheeler transform MEM-finding r-index
Online Access	Get more information
ISSN	1557-8666
DOI	10.1089/cmb.2021.0290

Cover

Abstract	Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the -index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the -index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the -index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the -index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.
AbstractList	Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the -index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the -index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the -index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the -index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.
Author	Rossi, Massimiliano Gagie, Travis Langmead, Ben Boucher, Christina Oliva, Marco
Author_xml	– sequence: 1 givenname: Massimiliano orcidid: 0000-0002-3012-1394 surname: Rossi fullname: Rossi, Massimiliano organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA – sequence: 2 givenname: Marco orcidid: 0000-0003-0525-3114 surname: Oliva fullname: Oliva, Marco organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA – sequence: 3 givenname: Ben orcidid: 0000-0003-2437-1976 surname: Langmead fullname: Langmead, Ben organization: Department of Computer Science, John Hopkins University, Baltimore, Maryland, USA – sequence: 4 givenname: Travis orcidid: 0000-0003-3689-327X surname: Gagie fullname: Gagie, Travis organization: Faculty of Computer Science, Dalhousie University, Halifax, Canada – sequence: 5 givenname: Christina orcidid: 0000-0001-9509-9725 surname: Boucher fullname: Boucher, Christina organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
BackLink	https://www.ncbi.nlm.nih.gov/pubmed/35041495$$D View this record in MEDLINE/PubMed
BookMark	eNo1j0tLw0AYRQdR7EOXbmX-QOK8vplJd6W0GmitC12XedZIMylJhPjvG1BXlwuHy7kzdJ2aFBB6oCSnRBdPrrY5I4zmhBXkCk0pgMq0lHKCZl33RQjlkqhbNOFABBUFTNFit38tF3iJ30w6htTUlcNl8mHAsWnxpkq-Ske8M0NVmxNeD8b1Y-vdZ-ju0E00py7c_-UcfWzW76uXbLt_LlfLbeYE8D4DAB2CBa-8jqBk9LEAwYyyGkAQIRSTURZOWMEFYYJxZyyjVBcjqghnc_T4u3v-tnXwh3M7urQ_h_8T7AKLqEXv
CitedBy_id	crossref_primary_10_1146_annurev_genom_021623_081639 crossref_primary_10_1093_bioinformatics_btae717 crossref_primary_10_1016_j_isci_2024_111464 crossref_primary_10_1093_bioinformatics_btae213 crossref_primary_10_1038_s41467_024_55762_1 crossref_primary_10_1093_bioinformatics_btad552 crossref_primary_10_1145_3701561 crossref_primary_10_1186_s13059_023_02958_1 crossref_primary_10_1089_cmb_2021_0445 crossref_primary_10_1016_j_ic_2024_105153 crossref_primary_10_1016_j_ic_2024_105155 crossref_primary_10_1186_s13059_023_02969_y crossref_primary_10_1371_journal_pcbi_1012665 crossref_primary_10_1093_bioinformatics_btad460 crossref_primary_10_1186_s13015_023_00225_3 crossref_primary_10_1186_s13015_025_00272_y crossref_primary_10_1093_bioadv_vbae113 crossref_primary_10_1101_gr_279143_124 crossref_primary_10_1016_j_isci_2024_110933
ContentType	Journal Article
DBID	CGR CUY CVF ECM EIF NPM
DOI	10.1089/cmb.2021.0290
DatabaseName	Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed
DatabaseTitle	MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid)
DatabaseTitleList	MEDLINE
Database_xml	– sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database
DeliveryMethod	no_fulltext_linktorsrc
Discipline	Biology Mathematics
EISSN	1557-8666
ExternalDocumentID	35041495
Genre	Research Support, U.S. Gov't, Non-P.H.S Research Support, Non-U.S. Gov't Journal Article Research Support, N.I.H., Extramural
GrantInformation_xml	– fundername: NIAID NIH HHS grantid: R01 AI141810 – fundername: NHGRI NIH HHS grantid: R01 HG011392
GroupedDBID	--- 0R~ 29K 34G 39C 4.4 53G 5GY ABBKN ABEFU ACGFO ADBBV AENEX AFOSN AI. ALMA_UNASSIGNED_HOLDINGS BAWUL BNQNF CAG CGR COF CS3 CUY CVF D-I DIK DU5 EBS ECM EIF EJD F5P IAO IER IGS IHR IM4 ITC MV1 NPM NQHIM O9- P2P R.V RIG RML RMSOB RNS TN5 TR2 UE5 VH1
ID	FETCH-LOGICAL-c453t-5558eeb5d7d8f576fdf9542a7b8554044726f69c4b43402423cab211896fd7032
IngestDate	Thu Apr 03 06:56:30 EDT 2025
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	2
Keywords	thresholds run-length-encoded Burrows-Wheeler transform MEM-finding r-index
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c453t-5558eeb5d7d8f576fdf9542a7b8554044726f69c4b43402423cab211896fd7032
ORCID	0000-0002-3012-1394 0000-0003-3689-327X 0000-0003-0525-3114 0000-0003-2437-1976 0000-0001-9509-9725
OpenAccessLink	https://www.ncbi.nlm.nih.gov/pmc/articles/8892979
PMID	35041495
ParticipantIDs	pubmed_primary_35041495
PublicationCentury	2000
PublicationDate	2022-02-00
PublicationDateYYYYMMDD	2022-02-01
PublicationDate_xml	– month: 02 year: 2022 text: 2022-02-00
PublicationDecade	2020
PublicationPlace	United States
PublicationPlace_xml	– name: United States
PublicationTitle	Journal of computational biology
PublicationTitleAlternate	J Comput Biol
PublicationYear	2022
SSID	ssj0013607
Score	2.4997575
Snippet	Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et...
SourceID	pubmed
SourceType	Index Database
StartPage	169
SubjectTerms	Algorithms Computational Biology Databases, Genetic - statistics & numerical data Genome, Bacterial Genome, Human Genomics - statistics & numerical data High-Throughput Nucleotide Sequencing - statistics & numerical data Humans Salmonella - genetics Sequence Alignment - statistics & numerical data Sequence Analysis, DNA - statistics & numerical data Software Wavelet Analysis
Title	MONI: A Pangenomic Index for Finding Maximal Exact Matches
URI	https://www.ncbi.nlm.nih.gov/pubmed/35041495
Volume	29
hasFullText
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA4-UPQgvt-Sg9dqbdM2601EXcUVD7vgbckTeugquMjqr3cmSbc-8XEpJWlL6PdlmpnmmyFkPzW8sJaLKI5tGjFW6Eigs6JSLo2VXOYp6p07N3m7x67usrsmlO3UJUN5oF6-1JX8B1VoA1xRJfsHZMcPhQY4B3zhCAjD8VcYgz289NLyW9QIOIUxzHhtRm734HnpJSsdMSorzB88QkFkRyBQj9-sSpWr8lBHCEOKpuavDMwgr_CBk6rEEMl9E6Ytn0SQ_6hx6zUMrKp51MjOLkQo8QzfyqfyXewB3NZ4vI_DBHuZwUcu94VTaoMaQhjlG7_WW8cjX5Xlk9WOOSY9VZUEfz3BHKq-gugbBB8qB2GaxQxdup97PyTRrrsmyWRRoP2-waBO_bMpj4uQfhVGcvhuHHNktr73g-PhFiDdRbIQMKInngZLZMIMlsmMryX6vEzmO-MEvI8r5BipcUxPaEMM6ohBgRg0EIMGYlBHDBqIsUp652fd03YUymREimXpMMKMbcbITBeaW3AfrbatjCWikLgFMYb5l-Q2bykmWcpwSZYqIcHv5y24FAx-skamBvcDs0FoZizX8CAwxYLxhAlYjeaGC82NOtI63iTr_hX0H3wulH79cra-7dkmcw11dsi0hclndmElN5R7DodXtG9FOQ
linkProvider	National Library of Medicine
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MONI%3A+A+Pangenomic+Index+for+Finding+Maximal+Exact+Matches&rft.jtitle=Journal+of+computational+biology&rft.au=Rossi%2C+Massimiliano&rft.au=Oliva%2C+Marco&rft.au=Langmead%2C+Ben&rft.au=Gagie%2C+Travis&rft.date=2022-02-01&rft.eissn=1557-8666&rft.volume=29&rft.issue=2&rft.spage=169&rft_id=info:doi/10.1089%2Fcmb.2021.0290&rft_id=info%3Apmid%2F35041495&rft_id=info%3Apmid%2F35041495&rft.externalDocID=35041495