Machine learning models for delineating marine microbial taxa
The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theo...
Saved in:
Published in | NAR genomics and bioinformatics Vol. 7; no. 2; p. lqaf090 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
England
01.06.2025
|
Subjects | |
Online Access | Get full text |
ISSN | 2631-9268 2631-9268 |
DOI | 10.1093/nargab/lqaf090 |
Cover
Abstract | The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys. |
---|---|
AbstractList | The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to
enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys. The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys.The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys. The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial taxa above genus level based on multiple genome similarity metrics are lacking. Addressing these gaps is important for macroevolutionary theory, biodiversity assessments, and discovery of novel taxa in metagenomes. Here, I develop machine learning classifier models, based on multiple genome similarity metrics, to determine whether any two marine bacterial and archaeal (prokaryotic) metagenome-assembled genomes (MAGs) belong to the same taxon, from the genus up to the phylum levels. Metrics include average amino acid and nucleotide identities, and fractions of shared genes within various categories, applied to 14 390 previously published non-redundant MAGs. At all taxonomic levels, the balanced accuracy (average of the true-positive and true-negative rate) of classifiers exceeded 92%, suggesting that simple genome similarity metrics serve as good taxon differentiators. Predictor selection and sensitivity analyses revealed gene categories, e.g. those involved in metabolism of cofactors and vitamins, particularly correlated to taxon divergence. Predicted taxon delineations were further used to de novo enumerate marine prokaryotic taxa. Statistical analyses of those enumerations suggest that over half of extant marine prokaryotic phyla, classes, and orders have already been recovered by genome-resolved metagenomic surveys. |
Author | Louca, Stilianos |
Author_xml | – sequence: 1 givenname: Stilianos orcidid: 0000-0001-9216-5619 surname: Louca fullname: Louca, Stilianos |
BackLink | https://www.ncbi.nlm.nih.gov/pubmed/40585302$$D View this record in MEDLINE/PubMed |
BookMark | eNpNkL1PwzAQxS1UREvpyogysqQ924kTDwyo4ksqYoHZsh27BDlOa6cS_Pe4tCCme7r30-neO0cj33uD0CWGOQZOF16GtVQLt5UWOJygCWEU55ywevRPj9Esxg8AIGVRFoDP0LiAsi4pkAm6eZb6vfUmc0YG3_p11vWNcTGzfciSSJYcftYy7LGu1aFXrXTZID_lBTq10kUzO84peru_e10-5quXh6fl7SrXhJMhZ5LWNdfccMyZsarSFKyyBAqjsMJJNNrKxjJIJmOYEF4pa6hWBa41r-gUXR_ubkK_3Zk4iK6N2jgnvel3UVBCyroimBcJvTqiO9WZRmxCm17_Er-REzA_AClIjMHYPwSD2NcqDrWKY630G26EbKE |
Cites_doi | 10.1038/s41587-020-0501-8 10.1186/s13059-016-0997-x 10.1093/nar/gkv1070 10.1007/BF00993106 10.1093/nar/gkab776 10.1093/bioinformatics/bts075 10.1214/10-AOAS436 10.3389/fmicb.2021.822301 10.1145/2733381 10.1073/pnas.1217767110 10.1038/nrmicro3330 10.1126/science.1153213 10.1371/journal.pcbi.1002195 10.2307/2413572 10.1038/nmicrobiol.2016.86 10.1126/science.aac9323 10.1007/s12275-021-1154-0 10.1038/s41564-017-0012-7 10.1038/s41597-023-01994-7 10.1186/1471-2164-14-913 10.1128/mSystems.00731-19 10.1111/biom.12200 10.1002/9781118960608 10.1111/j.1469-8137.1912.tb05611.x 10.1128/mBio.02475-19 10.1016/j.compeleceng.2013.11.024 10.1038/nmicrobiol.2016.48 10.1038/s41592-023-01940-w 10.1126/science.1136800 10.1038/s41467-018-07641-9 10.1016/j.physrep.2019.03.001 10.1038/s41587-020-0718-6 10.1046/j.1365-2656.2003.00748.x 10.1093/bioinformatics/btz859 10.1128/JB.01688-14 10.1073/pnas.1608281113 10.1007/s41664-018-0068-2 10.1111/biom.12332 10.1093/bioinformatics/btz848 10.3389/fmicb.2019.02407 10.1093/nar/gkaa970 10.1016/j.cell.2019.11.017 10.1038/nmeth.2575 10.1016/B978-0-444-81892-8.50040-7 10.1038/ismej.2011.162 10.1128/JB.187.18.6258-6264.2005 10.1145/3068335 10.1038/nrmicro2367 10.1080/01621459.1993.10594330 10.1016/B978-0-12-384719-5.00424-X |
ContentType | Journal Article |
Copyright | The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. |
Copyright_xml | – notice: The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. |
DBID | AAYXX CITATION CGR CUY CVF ECM EIF NPM 7X8 |
DOI | 10.1093/nargab/lqaf090 |
DatabaseName | CrossRef Medline MEDLINE MEDLINE (Ovid) MEDLINE MEDLINE PubMed MEDLINE - Academic |
DatabaseTitle | CrossRef MEDLINE Medline Complete MEDLINE with Full Text PubMed MEDLINE (Ovid) MEDLINE - Academic |
DatabaseTitleList | MEDLINE MEDLINE - Academic CrossRef |
Database_xml | – sequence: 1 dbid: NPM name: PubMed url: https://proxy.k.utb.cz/login?url=http://www.ncbi.nlm.nih.gov/entrez/query.fcgi?db=PubMed sourceTypes: Index Database – sequence: 2 dbid: EIF name: MEDLINE url: https://proxy.k.utb.cz/login?url=https://www.webofscience.com/wos/medline/basic-search sourceTypes: Index Database |
DeliveryMethod | fulltext_linktorsrc |
EISSN | 2631-9268 |
ExternalDocumentID | 40585302 10_1093_nargab_lqaf090 |
Genre | Journal Article |
GroupedDBID | 0R~ 53G AAFWJ AAPXW AAVAP AAYXX ABEJV ABGNP ABPTD ABXVV AFKRA AFPKN ALMA_UNASSIGNED_HOLDINGS AMNDL BBNVY BENPR BHPHI CCPQU CITATION EBS EMOBN GROUPED_DOAJ HCIFZ IAO KSI M7P M~E PHGZM PHGZT PIMPY RPM TOX CGR CUY CVF ECM EIF IGS IHR INH ITC NPM PQGLB 7X8 PUEGO |
ID | FETCH-LOGICAL-c292t-6a3889c9e9196efb7c30fbf204eb1b1f20dcfadf60efb6612297bfe3cb418c973 |
ISSN | 2631-9268 |
IngestDate | Fri Sep 05 15:46:06 EDT 2025 Mon Jul 21 05:59:02 EDT 2025 Thu Jul 03 08:39:02 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 2 |
Language | English |
License | https://creativecommons.org/licenses/by/4.0 The Author(s) 2025. Published by Oxford University Press on behalf of NAR Genomics and Bioinformatics. |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c292t-6a3889c9e9196efb7c30fbf204eb1b1f20dcfadf60efb6612297bfe3cb418c973 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ORCID | 0000-0001-9216-5619 |
OpenAccessLink | https://doi.org/10.1093/nargab/lqaf090 |
PMID | 40585302 |
PQID | 3225872194 |
PQPubID | 23479 |
ParticipantIDs | proquest_miscellaneous_3225872194 pubmed_primary_40585302 crossref_primary_10_1093_nargab_lqaf090 |
PublicationCentury | 2000 |
PublicationDate | 2025-06-01 |
PublicationDateYYYYMMDD | 2025-06-01 |
PublicationDate_xml | – month: 06 year: 2025 text: 2025-06-01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | England |
PublicationPlace_xml | – name: England |
PublicationTitle | NAR genomics and bioinformatics |
PublicationTitleAlternate | NAR Genom Bioinform |
PublicationYear | 2025 |
References | Kanehisa (2025062715130934900_B22) 2016; 44 Philippot (2025062715130934900_B15) 2010; 8 Yarza (2025062715130934900_B7) 2014; 12 Ferri (2025062715130934900_B34) 1994; 16 Parks (2025062715130934900_B10) 2022; 50 Barco (2025062715130934900_B12) 2020; 11 Hiseni (2025062715130934900_B14) 2022; 12 Chiu (2025062715130934900_B40) 2014; 70 Jain (2025062715130934900_B45) 2018; 9 Ondov (2025062715130934900_B28) 2016; 17 Konstantinidis (2025062715130934900_B5) 2005; 187 Falkowski (2025062715130934900_B17) 2008; 320 Real (2025062715130934900_B32) 1996; 45 Willis (2025062715130934900_B53) 2019; 10 Schaffer (2025062715130934900_B47) 1993; 13 Frey (2025062715130934900_B36) 2007; 315 Bunge (2025062715130934900_B42) 2012; 28 Gibbons (2025062715130934900_B19) 2013; 110 Campello (2025062715130934900_B38) 2015; 10 Hug (2025062715130934900_B1) 2016; 1 Aramaki (2025062715130934900_B26) 2019; 36 Kanehisa (2025062715130934900_B30) 2021; 49 Willis (2025062715130934900_B52) 2016; 113 Pedregosa (2025062715130934900_B33) 2011; 12 Ugland (2025062715130934900_B50) 2003; 72 Chklovski (2025062715130934900_B25) 2023; 20 Gonnella (2025062715130934900_B20) 2016; 1 Pachiadaki (2025062715130934900_B49) 2019; 179 Schubert (2025062715130934900_B37) 2017; 42 Mende (2025062715130934900_B6) 2013; 10 Shapiro (2025062715130934900_B46) 2019 Mehta (2025062715130934900_B21) 2019; 810 Chao (2025062715130934900_B39) 2016 Kim (2025062715130934900_B9) 2021; 59 Parks (2025062715130934900_B3) 2020; 38 Willis (2025062715130934900_B43) 2015; 71 Eddy (2025062715130934900_B27) 2011; 7 Chaumeil (2025062715130934900_B23) 2020; 36 Thompson (2025062715130934900_B4) 2013; 14 Bunge (2025062715130934900_B51) 1993; 88 Rocchetti (2025062715130934900_B41) 2011; 5 Martiny (2025062715130934900_B16) 2015; 350 Xu (2025062715130934900_B48) 2018; 2 Caporaso (2025062715130934900_B18) 2012; 6 Whitman (2025062715130934900_B13) 2015 Olm (2025062715130934900_B8) 2020; 5 Gotelli (2025062715130934900_B44) 2013 Nayfach (2025062715130934900_B24) 2021; 39 Albright (2025062715130934900_B29) 2023; 10 Parks (2025062715130934900_B2) 2017; 2 Jaccard (2025062715130934900_B31) 1912; 11 Chandrashekar (2025062715130934900_B35) 2014; 40 Qin (2025062715130934900_B11) 2014; 196 |
References_xml | – volume: 38 start-page: 1079 year: 2020 ident: 2025062715130934900_B3 article-title: A complete domain-to-species taxonomy for bacteria and archaea publication-title: Nat Biotechnol doi: 10.1038/s41587-020-0501-8 – volume: 17 start-page: 132 year: 2016 ident: 2025062715130934900_B28 article-title: Mash: fast genome and metagenome distance estimation using minhash publication-title: Genome Biol doi: 10.1186/s13059-016-0997-x – volume: 44 start-page: D457 year: 2016 ident: 2025062715130934900_B22 article-title: KEGG as a reference resource for gene and protein annotation publication-title: Nucleic Acids Res doi: 10.1093/nar/gkv1070 – volume: 12 start-page: 2825 year: 2011 ident: 2025062715130934900_B33 article-title: Scikit-learn: machine learning in python publication-title: J Mach Learn Res – volume: 13 start-page: 135 year: 1993 ident: 2025062715130934900_B47 article-title: Selecting a classification method by cross-validation publication-title: Mach Learn doi: 10.1007/BF00993106 – volume: 50 start-page: D785 year: 2022 ident: 2025062715130934900_B10 article-title: GTDB: an ongoing census of bacterial and archaeal diversity through a phylogenetically consistent, rank normalized and complete genome-based taxonomy publication-title: Nucleic Acids Res doi: 10.1093/nar/gkab776 – volume: 28 start-page: 1045 year: 2012 ident: 2025062715130934900_B42 article-title: Estimating population diversity with catchall publication-title: Bioinformatics doi: 10.1093/bioinformatics/bts075 – volume: 5 start-page: 1512 year: 2011 ident: 2025062715130934900_B41 article-title: Population size estimation based upon ratios of recapture probabilities publication-title: Ann Appl Stat doi: 10.1214/10-AOAS436 – volume: 12 start-page: 822301 year: 2022 ident: 2025062715130934900_B14 article-title: Questioning the quality of 16S rRNA gene sequences derived from human gut metagenome-assembled genomes publication-title: Front Microbiol doi: 10.3389/fmicb.2021.822301 – volume: 10 start-page: 5 year: 2015 ident: 2025062715130934900_B38 article-title: Hierarchical density estimates for data clustering, visualization, and outlier detection publication-title: ACM Trans Knowl Discov Data doi: 10.1145/2733381 – volume: 110 start-page: 4651 year: 2013 ident: 2025062715130934900_B19 article-title: Evidence for a persistent microbial seed bank throughout the global ocean publication-title: Proc Natl Acad Sci USA doi: 10.1073/pnas.1217767110 – volume: 12 start-page: 635 year: 2014 ident: 2025062715130934900_B7 article-title: Uniting the classification of cultured and uncultured bacteria and archaea using 16S rRNA gene sequences publication-title: Nat Rev Microbiol doi: 10.1038/nrmicro3330 – volume: 320 start-page: 1034 year: 2008 ident: 2025062715130934900_B17 article-title: The microbial engines that drive Earth’s biogeochemical cycles publication-title: Science doi: 10.1126/science.1153213 – volume: 7 start-page: e1002195 year: 2011 ident: 2025062715130934900_B27 article-title: Accelerated profile HMM searches publication-title: PLoS Comput Biol doi: 10.1371/journal.pcbi.1002195 – volume: 45 start-page: 380 year: 1996 ident: 2025062715130934900_B32 article-title: The probabilistic basis of Jaccard’s index of similarity publication-title: Syst Biol doi: 10.2307/2413572 – volume: 1 start-page: 16086 year: 2016 ident: 2025062715130934900_B20 article-title: Endemic hydrothermal vent species identified in the open ocean seed bank publication-title: Nat Microbiol doi: 10.1038/nmicrobiol.2016.86 – volume: 350 start-page: aac9323 year: 2015 ident: 2025062715130934900_B16 article-title: Microbiomes in light of traits: a phylogenetic perspective publication-title: Science doi: 10.1126/science.aac9323 – volume: 59 start-page: 476 year: 2021 ident: 2025062715130934900_B9 article-title: Introducing EzAAI: a pipeline for high throughput calculations of prokaryotic average amino acid identity publication-title: J Microbiol doi: 10.1007/s12275-021-1154-0 – volume: 2 start-page: 1533 year: 2017 ident: 2025062715130934900_B2 article-title: Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life publication-title: Nat Microbiol doi: 10.1038/s41564-017-0012-7 – volume: 10 start-page: 84 year: 2023 ident: 2025062715130934900_B29 article-title: Trait biases in microbial reference genomes publication-title: Sci Data doi: 10.1038/s41597-023-01994-7 – volume: 14 start-page: 913 year: 2013 ident: 2025062715130934900_B4 article-title: Microbial genomic taxonomy publication-title: BMC Genomics doi: 10.1186/1471-2164-14-913 – volume: 5 start-page: e00731-19 year: 2020 ident: 2025062715130934900_B8 article-title: Consistent metagenome-derived metrics verify and delineate bacterial species boundaries publication-title: mSystems doi: 10.1128/mSystems.00731-19 – volume: 70 start-page: 671 year: 2014 ident: 2025062715130934900_B40 article-title: An improved nonparametric lower bound of species richness via a modified Good–Turing frequency formula publication-title: Biometrics doi: 10.1111/biom.12200 – volume-title: Bergey’s Manual of Systematics of Archaea and Bacteria year: 2015 ident: 2025062715130934900_B13 doi: 10.1002/9781118960608 – start-page: 1 year: 2016 ident: 2025062715130934900_B39 article-title: Species richness: estimation and comparison publication-title: Wiley StatsRef: Statistics Reference Online – volume: 11 start-page: 37 year: 1912 ident: 2025062715130934900_B31 article-title: The distribution of the flora in the alpine zone publication-title: New Phytol doi: 10.1111/j.1469-8137.1912.tb05611.x – volume: 11 start-page: e02475 year: 2020 ident: 2025062715130934900_B12 article-title: A genus definition for bacteria and archaea based on a standard genome relatedness index publication-title: mBio doi: 10.1128/mBio.02475-19 – start-page: 31 volume-title: Population Genomics: Microorganisms year: 2019 ident: 2025062715130934900_B46 article-title: What microbial population genomics has taught us about speciation – volume: 40 start-page: 16 year: 2014 ident: 2025062715130934900_B35 article-title: A survey on feature selection methods publication-title: Comput Elect Eng doi: 10.1016/j.compeleceng.2013.11.024 – volume: 1 start-page: 16048 year: 2016 ident: 2025062715130934900_B1 article-title: A new view of the tree of life publication-title: Nat Microbiol doi: 10.1038/nmicrobiol.2016.48 – volume: 20 start-page: 1203 year: 2023 ident: 2025062715130934900_B25 article-title: CheckM2: a rapid, scalable and accurate tool for assessing microbial genome quality using machine learning publication-title: Nat Methods doi: 10.1038/s41592-023-01940-w – volume: 315 start-page: 972 year: 2007 ident: 2025062715130934900_B36 article-title: Clustering by passing messages between data points publication-title: Science doi: 10.1126/science.1136800 – volume: 9 start-page: 5114 year: 2018 ident: 2025062715130934900_B45 article-title: High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries publication-title: Nat Commun doi: 10.1038/s41467-018-07641-9 – volume: 810 start-page: 1 year: 2019 ident: 2025062715130934900_B21 article-title: A high-bias, low-variance introduction to machine learning for physicists publication-title: Phys Rep doi: 10.1016/j.physrep.2019.03.001 – volume: 39 start-page: 499 year: 2021 ident: 2025062715130934900_B24 article-title: A genomic catalog of Earth’s microbiomes publication-title: Nat Biotechnol doi: 10.1038/s41587-020-0718-6 – volume: 72 start-page: 888 year: 2003 ident: 2025062715130934900_B50 article-title: The species–accumulation curve and estimation of species richness publication-title: J Anim Ecol doi: 10.1046/j.1365-2656.2003.00748.x – volume: 36 start-page: 2251 year: 2019 ident: 2025062715130934900_B26 article-title: KofamKOALA: KEGG ortholog assignment based on profile HMM and adaptive score threshold publication-title: Bioinformatics doi: 10.1093/bioinformatics/btz859 – volume: 196 start-page: 2210 year: 2014 ident: 2025062715130934900_B11 article-title: A proposed genus boundary for the prokaryotes based on genomic insights publication-title: J Bacteriol doi: 10.1128/JB.01688-14 – volume: 113 start-page: E5096 year: 2016 ident: 2025062715130934900_B52 article-title: Extrapolating abundance curves has no predictive power for estimating microbial biodiversity publication-title: Proc Natl Acad Sci USA doi: 10.1073/pnas.1608281113 – volume: 2 start-page: 249 year: 2018 ident: 2025062715130934900_B48 article-title: On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning publication-title: J Anal Test doi: 10.1007/s41664-018-0068-2 – volume: 71 start-page: 1042 year: 2015 ident: 2025062715130934900_B43 article-title: Estimating diversity via frequency ratios publication-title: Biometrics doi: 10.1111/biom.12332 – volume: 36 start-page: 1925 year: 2020 ident: 2025062715130934900_B23 article-title: GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database publication-title: Bioinformatics doi: 10.1093/bioinformatics/btz848 – volume: 10 start-page: 2407 year: 2019 ident: 2025062715130934900_B53 article-title: Rarefaction, alpha diversity, and statistics publication-title: Front Microbiol doi: 10.3389/fmicb.2019.02407 – volume: 49 start-page: D545 year: 2021 ident: 2025062715130934900_B30 article-title: KEGG: integrating viruses and cellular organisms publication-title: Nucleic Acids Res doi: 10.1093/nar/gkaa970 – volume: 179 start-page: 1623 year: 2019 ident: 2025062715130934900_B49 article-title: Charting the complexity of the marine microbiome through single-cell genomics publication-title: Cell doi: 10.1016/j.cell.2019.11.017 – volume: 10 start-page: 881 year: 2013 ident: 2025062715130934900_B6 article-title: Accurate and universal delineation of prokaryotic species publication-title: Nat Methods doi: 10.1038/nmeth.2575 – volume: 16 start-page: 403 year: 1994 ident: 2025062715130934900_B34 article-title: Comparative study of techniques for large-scale feature selection publication-title: Mach Intell Patt Rec doi: 10.1016/B978-0-444-81892-8.50040-7 – volume: 6 start-page: 1089 year: 2012 ident: 2025062715130934900_B18 article-title: The western English channel contains a persistent microbial seed bank publication-title: ISME J doi: 10.1038/ismej.2011.162 – volume: 187 start-page: 6258 year: 2005 ident: 2025062715130934900_B5 article-title: Towards a genome-based taxonomy for prokaryotes publication-title: J Bacteriol doi: 10.1128/JB.187.18.6258-6264.2005 – volume: 42 start-page: 19 year: 2017 ident: 2025062715130934900_B37 article-title: DBSCAN revisited, revisited: why and how you should (still) use DBSCAN publication-title: ACM Trans Database Syst doi: 10.1145/3068335 – volume: 8 start-page: 523 year: 2010 ident: 2025062715130934900_B15 article-title: The ecological coherence of high bacterial taxonomic ranks publication-title: Nat Rev Microbiol doi: 10.1038/nrmicro2367 – volume: 88 start-page: 364 year: 1993 ident: 2025062715130934900_B51 article-title: Estimating the number of species: a review publication-title: J Am Stat Assoc doi: 10.1080/01621459.1993.10594330 – start-page: 195 volume-title: Encyclopedia of Biodiversity year: 2013 ident: 2025062715130934900_B44 article-title: Measuring and estimating species richness, species diversity, and biotic similarity from sampling data doi: 10.1016/B978-0-12-384719-5.00424-X |
SSID | ssj0002545401 |
Score | 2.2925282 |
Snippet | The relationship between gene content differences and microbial taxonomic divergence remains poorly understood, and algorithms for delineating novel microbial... |
SourceID | proquest pubmed crossref |
SourceType | Aggregation Database Index Database |
StartPage | lqaf090 |
SubjectTerms | Aquatic Organisms - classification Aquatic Organisms - genetics Archaea - classification Archaea - genetics Bacteria - classification Bacteria - genetics Genome, Bacterial Machine Learning Metagenome Phylogeny |
Title | Machine learning models for delineating marine microbial taxa |
URI | https://www.ncbi.nlm.nih.gov/pubmed/40585302 https://www.proquest.com/docview/3225872194 |
Volume | 7 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Na9wwEBVtcsmltKRJNm0XFQo9BCe2ZOvjmJYNoWy2JezC3owkS2UhsdvEC6W_viPL9npJCmkvRsgfMvPM6EmemYfQBwWztAbPHzFBTZQm2kaSJUnEldFUmIw45nOHr2bscpF-WWbLjWJok11S61Pz-9G8kv9BFfoAV58l-w_I9g-FDmgDvnAEhOH4JIyvmkhI20k_fA-6Nk2FhZPCJ5p7Qui7lc_xO7ldNVWXfHi5-qWGtHR2fu21lH2GcqjZrFdVW1K1HoTDT6u1USE2bOW3R6qtTQOSbYKbgm8hjCaRJEHR5tQ-0tc6Rz74BsjA0d38VC4OQp8PnHAoUFV6oV4NjcGl2_WuZ1_zi8V0ms8ny_lztEs4b360d_stfi6F5StQSr9q7t-ur7xJz8IQZ-0A28ziL8uFhjbMX6IXLd_H5wG8V-iZLfdRBxzugMMBOAwWxwPgcAAO98BhD9xrtLiYzD9fRq2ORWSIJHXEFBVCGmkluDvrNDc0dtqROIWJUifQKIxThWMxnAS-RIjk2llqdJoIIzk9QDtlVdojhLmwLFYGiJaiaayNFAXLMuGcygqhVTFCHzsb5D9CuZI8hBnQPFgrb601Qu87E-XgUfxvIlXaan2fexcvOMxk6QgdBtv1zwJ6L7zO1PET7n6D9jaf3lu0U9-t7TtgcLUeo91Pk9m363GzAzJuIP8D2XNOEA |
linkProvider | National Library of Medicine |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Machine+learning+models+for+delineating+marine+microbial+taxa&rft.jtitle=NAR+genomics+and+bioinformatics&rft.au=Louca%2C+Stilianos&rft.date=2025-06-01&rft.issn=2631-9268&rft.eissn=2631-9268&rft.volume=7&rft.issue=2&rft.spage=lqaf090&rft_id=info:doi/10.1093%2Fnargab%2Flqaf090&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2631-9268&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2631-9268&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2631-9268&client=summon |