Creating and Using Minimizer Sketches in Computational Genomics

Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimi...

Full description

Saved in:
Bibliographic Details
Published inJournal of computational biology Vol. 30; no. 12; pp. 1251 - 1276
Main Authors Zheng, Hongyu, Marçais, Guillaume, Kingsford, Carl
Format Journal Article
LanguageEnglish
Published United States Mary Ann Liebert, Inc., publishers 01.12.2023
Subjects
Online AccessGet full text
ISSN1557-8666
DOI10.1089/cmb.2023.0094

Cover

Loading…
Abstract Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
AbstractList Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Author Zheng, Hongyu
Kingsford, Carl
Marçais, Guillaume
Author_xml – sequence: 1
  givenname: Hongyu
  orcidid: 0000-0002-7668-2090
  surname: Zheng
  fullname: Zheng, Hongyu
  organization: Computer Science Department, Princeton University, Princeton, New Jersey, USA
– sequence: 2
  givenname: Guillaume
  surname: Marçais
  fullname: Marçais, Guillaume
  organization: Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
– sequence: 3
  givenname: Carl
  surname: Kingsford
  fullname: Kingsford, Carl
  organization: Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
BackLink https://www.ncbi.nlm.nih.gov/pubmed/37646787$$D View this record in MEDLINE/PubMed
BookMark eNo9kE1PhDAURRujcT506dbwB8BXHrRlZQzR0WSMC501actDq1AIMAv99QMZdXVfbk5u8s6KnfrWE2NXHCIOKruxjYliiDECyJITtuRpKkMlhFiw1TB8AnAUIM_ZAqVIhFRyyW7znvTo_HugfRnshvl6dt417of64PWLRvtBQ-B8kLdNtx8ntvW6Djbk28bZ4YKdVboe6PI312z3cP-WP4bbl81TfrcNLcZShCaWSIKSRGUkFOdVmgEpm051yRNj0RgJILmWIkaylVTllKLkqFAaBFyz6-NutzcNlUXXu0b338XfJxOAR2Cutfe1I0P9-A9yKGZHxeSomB0VsyM8AKl_Woc
CitedBy_id crossref_primary_10_1093_bioinformatics_btae736
crossref_primary_10_1109_TCBB_2024_3489478
crossref_primary_10_1093_bioinformatics_btae629
crossref_primary_10_1186_s13059_024_03414_4
crossref_primary_10_1089_cmb_2024_0544
crossref_primary_10_1101_gr_279339_124
crossref_primary_10_1186_s13015_025_00270_0
ContentType Journal Article
Copyright Hongyu Zheng, et al., 2023; Published by Mary Ann Liebert, Inc.
Copyright_xml – notice: Hongyu Zheng, et al., 2023; Published by Mary Ann Liebert, Inc.
DBID 1-M
CGR
CUY
CVF
ECM
EIF
NPM
DOI 10.1089/cmb.2023.0094
DatabaseName Mary Ann Liebert Online - Open Access
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
DatabaseTitle MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
DatabaseTitleList
MEDLINE
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
– sequence: 3
  dbid: 1-M
  name: Mary Ann Liebert Online - Open Access
  url: http://liebertopenaccess.com/OAJournals
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Biology
Mathematics
EISSN 1557-8666
EndPage 1276
ExternalDocumentID 37646787
10_1089_cmb_2023_0094
Genre Research Support, U.S. Gov't, Non-P.H.S
Review
Journal Article
Research Support, N.I.H., Extramural
GrantInformation_xml – fundername: NHGRI NIH HHS
  grantid: R01 HG012470
GroupedDBID ---
0R~
1-M
29K
4.4
53G
5GY
ABBKN
ACGFO
ADBBV
AENEX
AFOSN
ALMA_UNASSIGNED_HOLDINGS
BAWUL
BNQNF
CS3
D-I
DIK
DU5
EBS
F5P
IAO
IHR
IM4
MV1
NQHIM
O9-
P2P
RIG
RML
RNS
TN5
TR2
UE5
34G
39C
ABEFU
AI.
CAG
CGR
COF
CUY
CVF
ECM
EIF
EJD
IER
IGS
ITC
NPM
R.V
RMSOB
VH1
ID FETCH-LOGICAL-c3276-b273e6e4489e6811f590e8c5273d14bc3bb70071a7623ecf78d23e6d13837b303
IEDL.DBID 1-M
IngestDate Thu Apr 03 06:58:10 EDT 2025
Thu Sep 26 12:00:47 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 12
Keywords minimizers
sketching
mer counting
read mapping
de Bruijn graphs
k-mer counting
Language English
License This Open Access article is distributed under the terms of the Creative Commons License (http://creativecommons.org/licenses/by/4.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c3276-b273e6e4489e6811f590e8c5273d14bc3bb70071a7623ecf78d23e6d13837b303
ORCID 0000-0002-7668-2090
OpenAccessLink https://www.liebertpub.com/doi/abs/10.1089/cmb.2023.0094
PMID 37646787
PageCount 26
ParticipantIDs pubmed_primary_37646787
maryannliebert_primary_10_1089_cmb_2023_0094
PublicationCentury 2000
PublicationDate 20231201
2023-12-00
PublicationDateYYYYMMDD 2023-12-01
PublicationDate_xml – month: 12
  year: 2023
  text: 20231201
  day: 01
PublicationDecade 2020
PublicationPlace United States
PublicationPlace_xml – name: United States
PublicationTitle Journal of computational biology
PublicationTitleAlternate J Comput Biol
PublicationYear 2023
Publisher Mary Ann Liebert, Inc., publishers
Publisher_xml – name: Mary Ann Liebert, Inc., publishers
SSID ssj0013607
Score 2.4484317
SecondaryResourceType review_article
Snippet Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has...
SourceID pubmed
maryannliebert
SourceType Index Database
Publisher
StartPage 1251
SubjectTerms Algorithms
Genomics - methods
High-Throughput Nucleotide Sequencing - methods
Review Article
Sequence Analysis, DNA - methods
Software
Title Creating and Using Minimizer Sketches in Computational Genomics
URI https://www.liebertpub.com/doi/abs/10.1089/cmb.2023.0094
https://www.ncbi.nlm.nih.gov/pubmed/37646787
Volume 30
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3dS8MwED_mRHCg6PyaX-TBRzvbpkvTJxFxDqFD0MHeSpOmUmRR1vmgf72XpE7UJ1_aPvQj_K65-13ucgdwxkuznBCjpxrF1DgozBMSpzvyuFAwPyqprVuQjtloEt1NB9MW8K-9MC6vWZmkYhyPVdVmbueibhLikgs5E33T9LtvsuJWYDVE0mL-58BLv-MHzI-bipp_HunAhtkVlmvdfOgXo7SWZbgFmw0lJFdOhtvQUroLa65J5HsXOumysmq9A5fXluXpJ5LrgtiAP0krXc2qDzUnD89WCjWpNHH9Gpq1PnKr7P7jehcmw5vH65HXNEHwJA0RQ4H8QjGFXlSiGA-CcpD4iktTN60IIiGpELHhCTlqNaokQl_gmRWBcT0FGqg9aOsXrQ6ACFFwJCSUUbTJoWJ5xErhDwqWoxnjhezB-U9gsldX8SKzkWqeZIhjZnDMDI492HewLW9DZYV6l8eH_3vREayba5ckcgztxfxNnaCpX4hTK1U8ju_TTzCxpP8
linkProvider Mary Ann Liebert, Inc.
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT4NAEJ5ojdEmGq2v-tyDR6lQ6AInYxpr1dKLbdIbYdnFENPVlHrQX-_MQurr5ImEsEC-zc58szP7DcB5kNF2go-Rque7FKBwS6S43JHHtQW3vcw1ugXRkPfH3v2kM_nW6qusa1ZUVIz_Y0w1rW3ai64q4sLLdCpa1PW7RWVxy7BCklwkm-9Y0VcCgdt-Jan5Z0gdNuhYWKJ19aVflNK4lt4WbFackF2Xk7gNS0o3YLXsEvnegHq0kFYtduCqa2iefmKJlsxk_FmU63yaf6gZe3w201CwXLOyYUO12cdulTmAXOzCuHcz6vatqguClbptBFEgwVBcYRgVKh44TtYJbRWkJJwmHU-krhA-EYUEzZqrUsRe4pVLh2JPgR5qD2r6RasDYELIABmJy110ym3FE49nwu5InqAfC2TahIufwMSvpeRFbFLVQRgjjjHhGBOOTdgvYVs8htYKDW_gH_7vRWew1h9Fg3hwN3w4gnW6X1aMHENtPntTJ-j35-LUzPAnRUWnXQ
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LT8MwDLZgCMQkEIzXeObAkW7t0qXpCaHBGI9OSDBpt6ppUlShhWkbB_j1OGk1XidOlaqmrT4n9ufYsQFOeWa2EwL0VP2AGgeFOSLF5Y48riWY62fU1i2I-qw38G-H7eG3Vl9FXrMyScX4P1ZVm7WtxjIrM-LCZjoSDdP1u2HS4hZhCUk2N5Pbc6KvAAJzg7Kk5p8hVVgzx8ISrcsv_aKU1rR0N2C95ITkohDiJiwoXYPlokvkew2q0by06nQLzjuW5ulnkmhJbMSfRLnOR_mHmpDHFyuGKck1KRo2lJt95FrZA8jTbRh0r546PafsguCktIUgCiQYiil0o0LFuOdl7dBVPDWF06Tni5QKERiikKBaoypF7CVemfSM7ynQQu1ARb9qtQdECMmRkVBG0Si3FEt8lgm3LVmCdozLtA5nP4GJx0XJi9iGqnkYI46xwTE2ONZht4Bt_hhqK1S8PNj_34tOYOXhshvf3_TvDmDV3C4SRg6hMpu8qSM0-zNxbAX8CXVUpvA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Creating+and+Using+Minimizer+Sketches+in+Computational+Genomics&rft.jtitle=Journal+of+computational+biology&rft.date=2023-12-01&rft.pub=Mary+Ann+Liebert%2C+Inc.%2C+publishers&rft.eissn=1557-8666&rft.volume=30&rft.issue=12&rft.spage=1251&rft.epage=1276&rft_id=info:doi/10.1089%2Fcmb.2023.0094&rft.externalDocID=10_1089_cmb_2023_0094