Machine learning models for delineating marine microbial taxa

The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theo...

Full description

Saved in:
Bibliographic Details
Published inNAR genomics and bioinformatics Vol. 7; no. 2; p. lqaf090
Main Author Louca, Stilianos
Format Journal Article
LanguageEnglish
Published England 01.06.2025
Subjects
Online AccessGet full text
ISSN2631-9268
2631-9268
DOI10.1093/nargab/lqaf090

Cover

Abstract The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.
AbstractList The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.
The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.
The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.
Author Louca, Stilianos
Author_xml – sequence: 1
  givenname: Stilianos
  orcidid: 0000-0001-9216-5619
  surname: Louca
  fullname: Louca, Stilianos
BackLink https://www.ncbi.nlm.nih.gov/pubmed/40585302$$D View this record in MEDLINE/PubMed
BookMark eNpNkL1PwzAQxS1UREvpyogysqQ924kTDwyo4ksqYoHZsh27BDlOa6cS_Pe4tCCme7r30-neO0cj33uD0CWGOQZOF16GtVQLt5UWOJygCWEU55ywevRPj9Esxg8AIGVRFoDP0LiAsi4pkAm6eZb6vfUmc0YG3_p11vWNcTGzfciSSJYcftYy7LGu1aFXrXTZID_lBTq10kUzO84peru_e10-5quXh6fl7SrXhJMhZ5LWNdfccMyZsarSFKyyBAqjsMJJNNrKxjJIJmOYEF4pa6hWBa41r-gUXR_ubkK_3Zk4iK6N2jgnvel3UVBCyroimBcJvTqiO9WZRmxCm17_Er-REzA_AClIjMHYPwSD2NcqDrWKY630G26EbKE
Cites_doi 10.1038/s41587-020-0501-8
10.1186/s13059-016-0997-x
10.1093/nar/gkv1070
10.1007/BF00993106
10.1093/nar/gkab776
10.1093/bioinformatics/bts075
10.1214/10-AOAS436
10.3389/fmicb.2021.822301
10.1145/2733381
10.1073/pnas.1217767110
10.1038/nrmicro3330
10.1126/science.1153213
10.1371/journal.pcbi.1002195
10.2307/2413572
10.1038/nmicrobiol.2016.86
10.1126/science.aac9323
10.1007/s12275-021-1154-0
10.1038/s41564-017-0012-7
10.1038/s41597-023-01994-7
10.1186/1471-2164-14-913
10.1128/mSystems.00731-19
10.1111/biom.12200
10.1002/9781118960608
10.1111/j.1469-8137.1912.tb05611.x
10.1128/mBio.02475-19
10.1016/j.compeleceng.2013.11.024
10.1038/nmicrobiol.2016.48
10.1038/s41592-023-01940-w
10.1126/science.1136800
10.1038/s41467-018-07641-9
10.1016/j.physrep.2019.03.001
10.1038/s41587-020-0718-6
10.1046/j.1365-2656.2003.00748.x
10.1093/bioinformatics/btz859
10.1128/JB.01688-14
10.1073/pnas.1608281113
10.1007/s41664-018-0068-2
10.1111/biom.12332
10.1093/bioinformatics/btz848
10.3389/fmicb.2019.02407
10.1093/nar/gkaa970
10.1016/j.cell.2019.11.017
10.1038/nmeth.2575
10.1016/B978-0-444-81892-8.50040-7
10.1038/ismej.2011.162
10.1128/JB.187.18.6258-6264.2005
10.1145/3068335
10.1038/nrmicro2367
10.1080/01621459.1993.10594330
10.1016/B978-0-12-384719-5.00424-X
ContentType Journal Article
Copyright The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
Copyright_xml – notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
DBID AAYXX
CITATION
CGR
CUY
CVF
ECM
EIF
NPM
7X8
DOI 10.1093/nargab/lqaf090
DatabaseName CrossRef
Medline
MEDLINE
MEDLINE (Ovid)
MEDLINE
MEDLINE
PubMed
MEDLINE - Academic
DatabaseTitle CrossRef
MEDLINE
Medline Complete
MEDLINE with Full Text
PubMed
MEDLINE (Ovid)
MEDLINE - Academic
DatabaseTitleList MEDLINE
MEDLINE - Academic
CrossRef
Database_xml – sequence: 1
  dbid: NPM
  name: PubMed
  url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed
  sourceTypes: Index Database
– sequence: 2
  dbid: EIF
  name: MEDLINE
  url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search
  sourceTypes: Index Database
DeliveryMethod fulltext_linktorsrc
EISSN 2631-9268
ExternalDocumentID 40585302
10_1093_nargab_lqaf090
Genre Journal Article
GroupedDBID 0R~
53G
AAFWJ
AAPXW
AAVAP
AAYXX
ABEJV
ABGNP
ABPTD
ABXVV
AFKRA
AFPKN
ALMA_UNASSIGNED_HOLDINGS
AMNDL
BBNVY
BENPR
BHPHI
CCPQU
CITATION
EBS
EMOBN
GROUPED_DOAJ
HCIFZ
IAO
KSI
M7P
M~E
PHGZM
PHGZT
PIMPY
RPM
TOX
CGR
CUY
CVF
ECM
EIF
IGS
IHR
INH
ITC
NPM
PQGLB
7X8
PUEGO
ID FETCH-LOGICAL-c292t-6a3889c9e9196efb7c30fbf204eb1b1f20dcfadf60efb6612297bfe3cb418c973
ISSN 2631-9268
IngestDate Fri Sep 05 15:46:06 EDT 2025
Mon Jul 21 05:59:02 EDT 2025
Thu Jul 03 08:39:02 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 2
Language English
License https://creativecommons.org/licenses/by/4.0
The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics.
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c292t-6a3889c9e9196efb7c30fbf204eb1b1f20dcfadf60efb6612297bfe3cb418c973
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ORCID 0000-0001-9216-5619
OpenAccessLink https://doi.org/10.1093/nargab/lqaf090
PMID 40585302
PQID 3225872194
PQPubID 23479
ParticipantIDs proquest_miscellaneous_3225872194
pubmed_primary_40585302
crossref_primary_10_1093_nargab_lqaf090
PublicationCentury 2000
PublicationDate 2025-06-01
PublicationDateYYYYMMDD 2025-06-01
PublicationDate_xml – month: 06
  year: 2025
  text: 2025-06-01
  day: 01
PublicationDecade 2020
PublicationPlace England
PublicationPlace_xml – name: England
PublicationTitle NAR genomics and bioinformatics
PublicationTitleAlternate NAR Genom Bioinform
PublicationYear 2025
References Kanehisa (2025062715130934900_B22) 2016; 44
Philippot (2025062715130934900_B15) 2010; 8
Yarza (2025062715130934900_B7) 2014; 12
Ferri (2025062715130934900_B34) 1994; 16
Parks (2025062715130934900_B10) 2022; 50
Barco (2025062715130934900_B12) 2020; 11
Hiseni (2025062715130934900_B14) 2022; 12
Chiu (2025062715130934900_B40) 2014; 70
Jain (2025062715130934900_B45) 2018; 9
Ondov (2025062715130934900_B28) 2016; 17
Konstantinidis (2025062715130934900_B5) 2005; 187
Falkowski (2025062715130934900_B17) 2008; 320
Real (2025062715130934900_B32) 1996; 45
Willis (2025062715130934900_B53) 2019; 10
Schaffer (2025062715130934900_B47) 1993; 13
Frey (2025062715130934900_B36) 2007; 315
Bunge (2025062715130934900_B42) 2012; 28
Gibbons (2025062715130934900_B19) 2013; 110
Campello (2025062715130934900_B38) 2015; 10
Hug (2025062715130934900_B1) 2016; 1
Aramaki (2025062715130934900_B26) 2019; 36
Kanehisa (2025062715130934900_B30) 2021; 49
Willis (2025062715130934900_B52) 2016; 113
Pedregosa (2025062715130934900_B33) 2011; 12
Ugland (2025062715130934900_B50) 2003; 72
Chklovski (2025062715130934900_B25) 2023; 20
Gonnella (2025062715130934900_B20) 2016; 1
Pachiadaki (2025062715130934900_B49) 2019; 179
Schubert (2025062715130934900_B37) 2017; 42
Mende (2025062715130934900_B6) 2013; 10
Shapiro (2025062715130934900_B46) 2019
Mehta (2025062715130934900_B21) 2019; 810
Chao (2025062715130934900_B39) 2016
Kim (2025062715130934900_B9) 2021; 59
Parks (2025062715130934900_B3) 2020; 38
Willis (2025062715130934900_B43) 2015; 71
Eddy (2025062715130934900_B27) 2011; 7
Chaumeil (2025062715130934900_B23) 2020; 36
Thompson (2025062715130934900_B4) 2013; 14
Bunge (2025062715130934900_B51) 1993; 88
Rocchetti (2025062715130934900_B41) 2011; 5
Martiny (2025062715130934900_B16) 2015; 350
Xu (2025062715130934900_B48) 2018; 2
Caporaso (2025062715130934900_B18) 2012; 6
Whitman (2025062715130934900_B13) 2015
Olm (2025062715130934900_B8) 2020; 5
Gotelli (2025062715130934900_B44) 2013
Nayfach (2025062715130934900_B24) 2021; 39
Albright (2025062715130934900_B29) 2023; 10
Parks (2025062715130934900_B2) 2017; 2
Jaccard (2025062715130934900_B31) 1912; 11
Chandrashekar (2025062715130934900_B35) 2014; 40
Qin (2025062715130934900_B11) 2014; 196
References_xml – volume: 38
  start-page: 1079
  year: 2020
  ident: 2025062715130934900_B3
  article-title: A complete domain-to-species taxonomy for bacteria and archaea
  publication-title: Nat Biotechnol
  doi: 10.1038/s41587-020-0501-8
– volume: 17
  start-page: 132
  year: 2016
  ident: 2025062715130934900_B28
  article-title: Mash: fast genome and metagenome distance estimation using minhash
  publication-title: Genome Biol
  doi: 10.1186/s13059-016-0997-x
– volume: 44
  start-page: D457
  year: 2016
  ident: 2025062715130934900_B22
  article-title: KEGG as a reference resource for gene and protein annotation
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkv1070
– volume: 12
  start-page: 2825
  year: 2011
  ident: 2025062715130934900_B33
  article-title: Scikit-learn: machine learning in python
  publication-title: J Mach Learn Res
– volume: 13
  start-page: 135
  year: 1993
  ident: 2025062715130934900_B47
  article-title: Selecting a classification method by cross-validation
  publication-title: Mach Learn
  doi: 10.1007/BF00993106
– volume: 50
  start-page: D785
  year: 2022
  ident: 2025062715130934900_B10
  article-title: GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkab776
– volume: 28
  start-page: 1045
  year: 2012
  ident: 2025062715130934900_B42
  article-title: Estimating population diversity with catchall
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/bts075
– volume: 5
  start-page: 1512
  year: 2011
  ident: 2025062715130934900_B41
  article-title: Population size estimation based upon ratios of recapture probabilities
  publication-title: Ann Appl Stat
  doi: 10.1214/10-AOAS436
– volume: 12
  start-page: 822301
  year: 2022
  ident: 2025062715130934900_B14
  article-title: Questioning the quality of 16S rRNA gene sequences derived from human gut metagenome-assembled genomes
  publication-title: Front Microbiol
  doi: 10.3389/fmicb.2021.822301
– volume: 10
  start-page: 5
  year: 2015
  ident: 2025062715130934900_B38
  article-title: Hierarchical density estimates for data clustering, visualization, and outlier detection
  publication-title: ACM Trans Knowl Discov Data
  doi: 10.1145/2733381
– volume: 110
  start-page: 4651
  year: 2013
  ident: 2025062715130934900_B19
  article-title: Evidence for a persistent microbial seed bank throughout the global ocean
  publication-title: Proc Natl Acad Sci USA
  doi: 10.1073/pnas.1217767110
– volume: 12
  start-page: 635
  year: 2014
  ident: 2025062715130934900_B7
  article-title: Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences
  publication-title: Nat Rev Microbiol
  doi: 10.1038/nrmicro3330
– volume: 320
  start-page: 1034
  year: 2008
  ident: 2025062715130934900_B17
  article-title: The microbial engines that drive Earth’s biogeochemical cycles
  publication-title: Science
  doi: 10.1126/science.1153213
– volume: 7
  start-page: e1002195
  year: 2011
  ident: 2025062715130934900_B27
  article-title: Accelerated profile HMM searches
  publication-title: PLoS Comput Biol
  doi: 10.1371/journal.pcbi.1002195
– volume: 45
  start-page: 380
  year: 1996
  ident: 2025062715130934900_B32
  article-title: The probabilistic basis of Jaccard’s index of similarity
  publication-title: Syst Biol
  doi: 10.2307/2413572
– volume: 1
  start-page: 16086
  year: 2016
  ident: 2025062715130934900_B20
  article-title: Endemic hydrothermal vent species identified in the open ocean seed bank
  publication-title: Nat Microbiol
  doi: 10.1038/nmicrobiol.2016.86
– volume: 350
  start-page: aac9323
  year: 2015
  ident: 2025062715130934900_B16
  article-title: Microbiomes in light of traits: a phylogenetic perspective
  publication-title: Science
  doi: 10.1126/science.aac9323
– volume: 59
  start-page: 476
  year: 2021
  ident: 2025062715130934900_B9
  article-title: Introducing EzAAI: a pipeline for high throughput calculations of prokaryotic average amino acid identity
  publication-title: J Microbiol
  doi: 10.1007/s12275-021-1154-0
– volume: 2
  start-page: 1533
  year: 2017
  ident: 2025062715130934900_B2
  article-title: Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life
  publication-title: Nat Microbiol
  doi: 10.1038/s41564-017-0012-7
– volume: 10
  start-page: 84
  year: 2023
  ident: 2025062715130934900_B29
  article-title: Trait biases in microbial reference genomes
  publication-title: Sci Data
  doi: 10.1038/s41597-023-01994-7
– volume: 14
  start-page: 913
  year: 2013
  ident: 2025062715130934900_B4
  article-title: Microbial genomic taxonomy
  publication-title: BMC Genomics
  doi: 10.1186/1471-2164-14-913
– volume: 5
  start-page: e00731-19
  year: 2020
  ident: 2025062715130934900_B8
  article-title: Consistent metagenome-derived metrics verify and delineate bacterial species boundaries
  publication-title: mSystems
  doi: 10.1128/mSystems.00731-19
– volume: 70
  start-page: 671
  year: 2014
  ident: 2025062715130934900_B40
  article-title: An improved nonparametric lower bound of species richness via a modified Good–Turing frequency formula
  publication-title: Biometrics
  doi: 10.1111/biom.12200
– volume-title: Bergey’s Manual of Systematics of Archaea and Bacteria
  year: 2015
  ident: 2025062715130934900_B13
  doi: 10.1002/9781118960608
– start-page: 1
  year: 2016
  ident: 2025062715130934900_B39
  article-title: Species richness: estimation and comparison
  publication-title: Wiley StatsRef: Statistics Reference Online
– volume: 11
  start-page: 37
  year: 1912
  ident: 2025062715130934900_B31
  article-title: The distribution of the flora in the alpine zone
  publication-title: New Phytol
  doi: 10.1111/j.1469-8137.1912.tb05611.x
– volume: 11
  start-page: e02475
  year: 2020
  ident: 2025062715130934900_B12
  article-title: A genus definition for bacteria and archaea based on a standard genome relatedness index
  publication-title: mBio
  doi: 10.1128/mBio.02475-19
– start-page: 31
  volume-title: Population Genomics: Microorganisms
  year: 2019
  ident: 2025062715130934900_B46
  article-title: What microbial population genomics has taught us about speciation
– volume: 40
  start-page: 16
  year: 2014
  ident: 2025062715130934900_B35
  article-title: A survey on feature selection methods
  publication-title: Comput Elect Eng
  doi: 10.1016/j.compeleceng.2013.11.024
– volume: 1
  start-page: 16048
  year: 2016
  ident: 2025062715130934900_B1
  article-title: A new view of the tree of life
  publication-title: Nat Microbiol
  doi: 10.1038/nmicrobiol.2016.48
– volume: 20
  start-page: 1203
  year: 2023
  ident: 2025062715130934900_B25
  article-title: CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning
  publication-title: Nat Methods
  doi: 10.1038/s41592-023-01940-w
– volume: 315
  start-page: 972
  year: 2007
  ident: 2025062715130934900_B36
  article-title: Clustering by passing messages between data points
  publication-title: Science
  doi: 10.1126/science.1136800
– volume: 9
  start-page: 5114
  year: 2018
  ident: 2025062715130934900_B45
  article-title: High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries
  publication-title: Nat Commun
  doi: 10.1038/s41467-018-07641-9
– volume: 810
  start-page: 1
  year: 2019
  ident: 2025062715130934900_B21
  article-title: A high-bias, low-variance introduction to machine learning for physicists
  publication-title: Phys Rep
  doi: 10.1016/j.physrep.2019.03.001
– volume: 39
  start-page: 499
  year: 2021
  ident: 2025062715130934900_B24
  article-title: A genomic catalog of Earth’s microbiomes
  publication-title: Nat Biotechnol
  doi: 10.1038/s41587-020-0718-6
– volume: 72
  start-page: 888
  year: 2003
  ident: 2025062715130934900_B50
  article-title: The species–accumulation curve and estimation of species richness
  publication-title: J Anim Ecol
  doi: 10.1046/j.1365-2656.2003.00748.x
– volume: 36
  start-page: 2251
  year: 2019
  ident: 2025062715130934900_B26
  article-title: KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btz859
– volume: 196
  start-page: 2210
  year: 2014
  ident: 2025062715130934900_B11
  article-title: A proposed genus boundary for the prokaryotes based on genomic insights
  publication-title: J Bacteriol
  doi: 10.1128/JB.01688-14
– volume: 113
  start-page: E5096
  year: 2016
  ident: 2025062715130934900_B52
  article-title: Extrapolating abundance curves has no predictive power for estimating microbial biodiversity
  publication-title: Proc Natl Acad Sci USA
  doi: 10.1073/pnas.1608281113
– volume: 2
  start-page: 249
  year: 2018
  ident: 2025062715130934900_B48
  article-title: On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning
  publication-title: J Anal Test
  doi: 10.1007/s41664-018-0068-2
– volume: 71
  start-page: 1042
  year: 2015
  ident: 2025062715130934900_B43
  article-title: Estimating diversity via frequency ratios
  publication-title: Biometrics
  doi: 10.1111/biom.12332
– volume: 36
  start-page: 1925
  year: 2020
  ident: 2025062715130934900_B23
  article-title: GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database
  publication-title: Bioinformatics
  doi: 10.1093/bioinformatics/btz848
– volume: 10
  start-page: 2407
  year: 2019
  ident: 2025062715130934900_B53
  article-title: Rarefaction, alpha diversity, and statistics
  publication-title: Front Microbiol
  doi: 10.3389/fmicb.2019.02407
– volume: 49
  start-page: D545
  year: 2021
  ident: 2025062715130934900_B30
  article-title: KEGG: integrating viruses and cellular organisms
  publication-title: Nucleic Acids Res
  doi: 10.1093/nar/gkaa970
– volume: 179
  start-page: 1623
  year: 2019
  ident: 2025062715130934900_B49
  article-title: Charting the complexity of the marine microbiome through single-cell genomics
  publication-title: Cell
  doi: 10.1016/j.cell.2019.11.017
– volume: 10
  start-page: 881
  year: 2013
  ident: 2025062715130934900_B6
  article-title: Accurate and universal delineation of prokaryotic species
  publication-title: Nat Methods
  doi: 10.1038/nmeth.2575
– volume: 16
  start-page: 403
  year: 1994
  ident: 2025062715130934900_B34
  article-title: Comparative study of techniques for large-scale feature selection
  publication-title: Mach Intell Patt Rec
  doi: 10.1016/B978-0-444-81892-8.50040-7
– volume: 6
  start-page: 1089
  year: 2012
  ident: 2025062715130934900_B18
  article-title: The western English channel contains a persistent microbial seed bank
  publication-title: ISME J
  doi: 10.1038/ismej.2011.162
– volume: 187
  start-page: 6258
  year: 2005
  ident: 2025062715130934900_B5
  article-title: Towards a genome-based taxonomy for prokaryotes
  publication-title: J Bacteriol
  doi: 10.1128/JB.187.18.6258-6264.2005
– volume: 42
  start-page: 19
  year: 2017
  ident: 2025062715130934900_B37
  article-title: DBSCAN revisited, revisited: why and how you should (still) use DBSCAN
  publication-title: ACM Trans Database Syst
  doi: 10.1145/3068335
– volume: 8
  start-page: 523
  year: 2010
  ident: 2025062715130934900_B15
  article-title: The ecological coherence of high bacterial taxonomic ranks
  publication-title: Nat Rev Microbiol
  doi: 10.1038/nrmicro2367
– volume: 88
  start-page: 364
  year: 1993
  ident: 2025062715130934900_B51
  article-title: Estimating the number of species: a review
  publication-title: J Am Stat Assoc
  doi: 10.1080/01621459.1993.10594330
– start-page: 195
  volume-title: Encyclopedia of Biodiversity
  year: 2013
  ident: 2025062715130934900_B44
  article-title: Measuring and estimating species richness, species diversity, and biotic similarity from sampling data
  doi: 10.1016/B978-0-12-384719-5.00424-X
SSID ssj0002545401
Score 2.2925282
Snippet The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial...
SourceID proquest
pubmed
crossref
SourceType Aggregation Database
Index Database
StartPage lqaf090
SubjectTerms Aquatic Organisms - classification
Aquatic Organisms - genetics
Archaea - classification
Archaea - genetics
Bacteria - classification
Bacteria - genetics
Genome, Bacterial
Machine Learning
Metagenome
Phylogeny
Title Machine learning models for delineating marine microbial taxa
URI https://www.ncbi.nlm.nih.gov/pubmed/40585302
https://www.proquest.com/docview/3225872194
Volume 7
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Na9wwEBVtcsmltKRJNm0XFQo9BCe2ZOvjmJYNoWy2JezC3owkS2UhsdvEC6W_viPL9npJCmkvRsgfMvPM6EmemYfQBwWztAbPHzFBTZQm2kaSJUnEldFUmIw45nOHr2bscpF-WWbLjWJok11S61Pz-9G8kv9BFfoAV58l-w_I9g-FDmgDvnAEhOH4JIyvmkhI20k_fA-6Nk2FhZPCJ5p7Qui7lc_xO7ldNVWXfHi5-qWGtHR2fu21lH2GcqjZrFdVW1K1HoTDT6u1USE2bOW3R6qtTQOSbYKbgm8hjCaRJEHR5tQ-0tc6Rz74BsjA0d38VC4OQp8PnHAoUFV6oV4NjcGl2_WuZ1_zi8V0ms8ny_lztEs4b360d_stfi6F5StQSr9q7t-ur7xJz8IQZ-0A28ziL8uFhjbMX6IXLd_H5wG8V-iZLfdRBxzugMMBOAwWxwPgcAAO98BhD9xrtLiYzD9fRq2ORWSIJHXEFBVCGmkluDvrNDc0dtqROIWJUifQKIxThWMxnAS-RIjk2llqdJoIIzk9QDtlVdojhLmwLFYGiJaiaayNFAXLMuGcygqhVTFCHzsb5D9CuZI8hBnQPFgrb601Qu87E-XgUfxvIlXaan2fexcvOMxk6QgdBtv1zwJ6L7zO1PET7n6D9jaf3lu0U9-t7TtgcLUeo91Pk9m363GzAzJuIP8D2XNOEA
linkProvider National Library of Medicine
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Machine+learning+models+for+delineating+marine+microbial+taxa&rft.jtitle=NAR+genomics+and+bioinformatics&rft.au=Louca%2C+Stilianos&rft.date=2025-06-01&rft.issn=2631-9268&rft.eissn=2631-9268&rft.volume=7&rft.issue=2&rft.spage=lqaf090&rft_id=info:doi/10.1093%2Fnargab%2Flqaf090&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2631-9268&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2631-9268&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2631-9268&client=summon