MONI: A Pangenomic Index for Finding Maximal Exact Matches

Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the -index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact...

Full description

Saved in:
Bibliographic Details
Published inJournal of computational biology Vol. 29; no. 2; p. 169
Main Authors Rossi, Massimiliano, Oliva, Marco, Langmead, Ben, Gagie, Travis, Boucher, Christina
Format Journal Article
LanguageEnglish
Published United States 01.02.2022
Subjects
Online AccessGet more information
ISSN1557-8666
DOI10.1089/cmb.2021.0290

Cover

Abstract Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the -index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the -index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the -index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the -index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.
AbstractList Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et al. showed how to build the -index efficiently via a technique called prefix-free parsing (PFP) and demonstrated its effectiveness for exact pattern matching. Exact pattern matching can be leveraged to support approximate pattern matching, but the -index itself cannot support efficiently popular and important queries such as finding maximal exact matches (MEMs). To address this shortcoming, Bannai et al. introduced the concept of thresholds, and showed that storing them together with the -index enables efficient MEM finding-but they did not say how to find those thresholds. We present a novel algorithm that applies PFP to build the -index and find the thresholds simultaneously and in linear time and space with respect to the size of the prefix-free parse. Our implementation called can rapidly find MEMs between reads and large-sequence collections of highly repetitive sequences. Compared with other read aligners-PuffAligner, Bowtie2, BWA-MEM, and CHIC- MONI used 2-11 times less memory and was 2-32 times faster for index construction. Moreover, MONI was less than one thousandth the size of competing indexes for large collections of human chromosomes. Thus, MONI represents a major advance in our ability to perform MEM finding against very large collections of related references.
Author Rossi, Massimiliano
Gagie, Travis
Langmead, Ben
Boucher, Christina
Oliva, Marco
Author_xml – sequence: 1
  givenname: Massimiliano
  orcidid: 0000-0002-3012-1394
  surname: Rossi
  fullname: Rossi, Massimiliano
  organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
– sequence: 2
  givenname: Marco
  orcidid: 0000-0003-0525-3114
  surname: Oliva
  fullname: Oliva, Marco
  organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
– sequence: 3
  givenname: Ben
  orcidid: 0000-0003-2437-1976
  surname: Langmead
  fullname: Langmead, Ben
  organization: Department of Computer Science, John Hopkins University, Baltimore, Maryland, USA
– sequence: 4
  givenname: Travis
  orcidid: 0000-0003-3689-327X
  surname: Gagie
  fullname: Gagie, Travis
  organization: Faculty of Computer Science, Dalhousie University, Halifax, Canada
– sequence: 5
  givenname: Christina
  orcidid: 0000-0001-9509-9725
  surname: Boucher
  fullname: Boucher, Christina
  organization: Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
BackLink https://www.ncbi.nlm.nih.gov/pubmed/35041495$$D View this record in MEDLINE/PubMed
BookMark eNo1j0tLw0AYRQdR7EOXbmX-QOK8vplJd6W0GmitC12XedZIMylJhPjvG1BXlwuHy7kzdJ2aFBB6oCSnRBdPrrY5I4zmhBXkCk0pgMq0lHKCZl33RQjlkqhbNOFABBUFTNFit38tF3iJ30w6htTUlcNl8mHAsWnxpkq-Ske8M0NVmxNeD8b1Y-vdZ-ju0E00py7c_-UcfWzW76uXbLt_LlfLbeYE8D4DAB2CBa-8jqBk9LEAwYyyGkAQIRSTURZOWMEFYYJxZyyjVBcjqghnc_T4u3v-tnXwh3M7urQ_h_8T7AKLqEXv
CitedBy_id crossref_primary_10_1146_annurev_genom_021623_081639
crossref_primary_10_1093_bioinformatics_btae717
crossref_primary_10_1016_j_isci_2024_111464
crossref_primary_10_1093_bioinformatics_btae213
crossref_primary_10_1038_s41467_024_55762_1
crossref_primary_10_1093_bioinformatics_btad552
crossref_primary_10_1145_3701561
crossref_primary_10_1186_s13059_023_02958_1
crossref_primary_10_1089_cmb_2021_0445
crossref_primary_10_1016_j_ic_2024_105153
crossref_primary_10_1016_j_ic_2024_105155
crossref_primary_10_1186_s13059_023_02969_y
crossref_primary_10_1371_journal_pcbi_1012665
crossref_primary_10_1093_bioinformatics_btad460
crossref_primary_10_1186_s13015_023_00225_3
crossref_primary_10_1186_s13015_025_00272_y
crossref_primary_10_1093_bioadv_vbae113
crossref_primary_10_1101_gr_279143_124
crossref_primary_10_1016_j_isci_2024_110933
ContentType Journal Article
DBID CGR
CUY
CVF
ECM
EIF
NPM
DOI 10.1089/cmb.2021.0290
DatabaseName Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
DatabaseTitleList MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod no_fulltext_linktorsrc
Discipline Biology
Mathematics
EISSN 1557-8666
ExternalDocumentID 35041495
Genre Research Support, U.S. Gov't, Non-P.H.S
Research Support, Non-U.S. Gov't
Journal Article
Research Support, N.I.H., Extramural
GrantInformation_xml – fundername: NIAID NIH HHS
  grantid: R01 AI141810
– fundername: NHGRI NIH HHS
  grantid: R01 HG011392
GroupedDBID ---
0R~
29K
34G
39C
4.4
53G
5GY
ABBKN
ABEFU
ACGFO
ADBBV
AENEX
AFOSN
AI.
ALMA_UNASSIGNED_HOLDINGS
BAWUL
BNQNF
CAG
CGR
COF
CS3
CUY
CVF
D-I
DIK
DU5
EBS
ECM
EIF
EJD
F5P
IAO
IER
IGS
IHR
IM4
ITC
MV1
NPM
NQHIM
O9-
P2P
R.V
RIG
RML
RMSOB
RNS
TN5
TR2
UE5
VH1
ID FETCH-LOGICAL-c453t-5558eeb5d7d8f576fdf9542a7b8554044726f69c4b43402423cab211896fd7032
IngestDate Thu Apr 03 06:56:30 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Keywords thresholds
run-length-encoded Burrows-Wheeler transform
MEM-finding
r-index
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c453t-5558eeb5d7d8f576fdf9542a7b8554044726f69c4b43402423cab211896fd7032
ORCID 0000-0002-3012-1394
0000-0003-3689-327X
0000-0003-0525-3114
0000-0003-2437-1976
0000-0001-9509-9725
OpenAccessLink https://www.ncbi.nlm.nih.gov/pmc/articles/8892979
PMID 35041495
ParticipantIDs pubmed_primary_35041495
PublicationCentury 2000
PublicationDate 2022-02-00
PublicationDateYYYYMMDD 2022-02-01
PublicationDate_xml – month: 02
  year: 2022
  text: 2022-02-00
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of computational biology
PublicationTitleAlternate J Comput Biol
PublicationYear 2022
SSID ssj0013607
Score 2.4997575
Snippet Recently, Gagie et al. proposed a version of the FM-index, called the -index, that can store thousands of human genomes on a commodity computer. Then Kuhnle et...
SourceID pubmed
SourceType Index Database
StartPage 169
SubjectTerms Algorithms
Computational Biology
Databases, Genetic - statistics & numerical data
Genome, Bacterial
Genome, Human
Genomics - statistics & numerical data
High-Throughput Nucleotide Sequencing - statistics & numerical data
Humans
Salmonella - genetics
Sequence Alignment - statistics & numerical data
Sequence Analysis, DNA - statistics & numerical data
Software
Wavelet Analysis
Title MONI: A Pangenomic Index for Finding Maximal Exact Matches
URI https://www.ncbi.nlm.nih.gov/pubmed/35041495
Volume 29
hasFullText
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LS8QwEA4-UPQgvt-Sg9dqbdM2601EXcUVD7vgbckTeugquMjqr3cmSbc-8XEpJWlL6PdlmpnmmyFkPzW8sJaLKI5tGjFW6Eigs6JSLo2VXOYp6p07N3m7x67usrsmlO3UJUN5oF6-1JX8B1VoA1xRJfsHZMcPhQY4B3zhCAjD8VcYgz289NLyW9QIOIUxzHhtRm734HnpJSsdMSorzB88QkFkRyBQj9-sSpWr8lBHCEOKpuavDMwgr_CBk6rEEMl9E6Ytn0SQ_6hx6zUMrKp51MjOLkQo8QzfyqfyXewB3NZ4vI_DBHuZwUcu94VTaoMaQhjlG7_WW8cjX5Xlk9WOOSY9VZUEfz3BHKq-gugbBB8qB2GaxQxdup97PyTRrrsmyWRRoP2-waBO_bMpj4uQfhVGcvhuHHNktr73g-PhFiDdRbIQMKInngZLZMIMlsmMryX6vEzmO-MEvI8r5BipcUxPaEMM6ohBgRg0EIMGYlBHDBqIsUp652fd03YUymREimXpMMKMbcbITBeaW3AfrbatjCWikLgFMYb5l-Q2bykmWcpwSZYqIcHv5y24FAx-skamBvcDs0FoZizX8CAwxYLxhAlYjeaGC82NOtI63iTr_hX0H3wulH79cra-7dkmcw11dsi0hclndmElN5R7DodXtG9FOQ
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MONI%3A+A+Pangenomic+Index+for+Finding+Maximal+Exact+Matches&rft.jtitle=Journal+of+computational+biology&rft.au=Rossi%2C+Massimiliano&rft.au=Oliva%2C+Marco&rft.au=Langmead%2C+Ben&rft.au=Gagie%2C+Travis&rft.date=2022-02-01&rft.eissn=1557-8666&rft.volume=29&rft.issue=2&rft.spage=169&rft_id=info:doi/10.1089%2Fcmb.2021.0290&rft_id=info%3Apmid%2F35041495&rft_id=info%3Apmid%2F35041495&rft.externalDocID=35041495