Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]
Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determin...
Saved in:
Published in | F1000 research Vol. 11; p. 573 |
---|---|
Main Author | |
Format | Journal Article |
Language | English |
Published |
2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms; all of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusions for ease of integration. |
---|---|
AbstractList | Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms; all of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusions for ease of integration. |
Author | Zambelli, Antoine |
Author_xml | – sequence: 1 givenname: Antoine orcidid: 0000-0001-9635-0061 surname: Zambelli fullname: Zambelli, Antoine email: antoine.zambelli@gmail.com organization: University of California, Berkeley, Berkeley, CA, 94720, USA |
BookMark | eNqFkEtOwzAURS1UJErpGvACaLGd1EnoCFXlI1ViAiOEIsd5bo0SO7KTVN0Ry8RtEZ8Ro_fRuffZ9xwNjDWA0CUlU8p4ml4rSghx4EE4uQk7Gqd8Sk_QkJGYT2hM2OBXf4bG3r8HBcmyiLNkiD6WxkNdVIBraDe2xMo6LKvOt-Cw6eoilBLCUGsjWm0NFqbEolpbp9tNjT1UIA97bXBnfNeA67WHElfhSUabNX7twfk9Qee4geDnoNewvcEMi6Zxtg_wNpjh_Tdcf7jirzDFxrbfxNsFOlWi8jD-qiP0crd8XjxMVk_3j4vb1UQymtFJEhGhorQQRRJFwAnNWKHKmMxKooqMSZYIJeUs5kAzFXMpE86ZBFYmSQwzVUYjlBx9pbPeO1B543Qt3C6nJD9knv_JPD9mntOgnB-VSsiuand7Kv_B_lF_AlQQkQI |
Cites_doi | 10.1023/A:1023949509487 10.1142/S0218001411008683 10.12688/f1000research.10103.1 10.1093/bioinformatics/bti517 10.1093/bioinformatics/btm463 10.1109/ICDM.2012.123 10.1186/s13638-021-01910-w 10.18637/jss.v053.i09 10.1038/srep06207 10.1534/genetics.120.303096 10.1109/ACCESS.2018.2843564 10.1007/s13042-017-0756-7 |
ContentType | Journal Article |
Copyright | Copyright: © 2022 Zambelli A |
Copyright_xml | – notice: Copyright: © 2022 Zambelli A |
DBID | C-E CH4 AAYXX CITATION |
DOI | 10.12688/f1000research.121486.1 |
DatabaseName | F1000Research Faculty of 1000 CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Medicine Women's Studies |
EISSN | 2046-1402 |
ExternalDocumentID | 10_12688_f1000research_121486_1 |
GroupedDBID | 3V. 53G 5VS 7X7 88I 8FE 8FH 8FI 8FJ AAFWJ ABUWG ACGOD ACPRK ADBBV ADRAZ AFKRA AFPKN AHMBA ALMA_UNASSIGNED_HOLDINGS AOIJS AZQEC BAWUL BBNVY BCNDV BENPR BHPHI BPHCQ BVXVI C-E CCPQU CH4 DIK DWQXO FRP FYUFA GNUQQ GROUPED_DOAJ GX1 HCIFZ HMCUK HYE KQ8 LK8 M2P M48 M7P M~E OK1 PIMPY PQQKQ PROAC RPM UKHRP W2D AAYXX ALIPV CITATION PGMZT PHGZM PHGZT |
ID | FETCH-LOGICAL-c2191-730af38bab733e60192bfd405d0fb92c27afcc546e19f46cc7662ce2d774e5fd3 |
IEDL.DBID | M48 |
ISSN | 2046-1402 |
IngestDate | Tue Jul 01 04:27:33 EDT 2025 Sat Jan 27 03:05:08 EST 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | K-means Gaussian Mixture Number of Clusters Ensemble Consensus Clustering Hierarchical Clustering Spectral Clustering Unsupervised Learning Clustering |
Language | English |
License | http://creativecommons.org/licenses/by/4.0/: This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c2191-730af38bab733e60192bfd405d0fb92c27afcc546e19f46cc7662ce2d774e5fd3 |
ORCID | 0000-0001-9635-0061 |
OpenAccessLink | http://journals.scholarsportal.info/openUrl.xqy?doi=10.12688/f1000research.121486.1 |
ParticipantIDs | crossref_primary_10_12688_f1000research_121486_1 faculty1000_research_10_12688_f1000research_121486_1 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2022 2022-00-00 |
PublicationDateYYYYMMDD | 2022-01-01 |
PublicationDate_xml | – year: 2022 text: 2022 |
PublicationDecade | 2020 |
PublicationTitle | F1000 research |
PublicationYear | 2022 |
References | S Vega-Pons (ref7) 2011; 25 J Handl (ref13) 08 2005; 21 A Caoli (ref3) 2019 A Zambelli (ref6) 2016; 5 T Alqurashi (ref9) 2019; 10 F Pedregosa (ref4) 2011; 12 X Song (ref2) 2018; 6 D Müllner (ref5) 2013; 53 M McGuirl (ref1) 06 2020; 215 S Monti (ref8) 2003; 52 J Yi (ref12) 2012 Y Șenbabaoğlu (ref11) 2014; 4 Z Yu (ref10) 09 2007; 23 C Shi (ref15) 2021; 2021 |
References_xml | – volume: 52 start-page: 91-118 year: 2003 ident: ref8 article-title: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. publication-title: Mach. Learn. doi: 10.1023/A:1023949509487 – volume: 25 start-page: 337-372 year: 2011 ident: ref7 article-title: A survey of custering ensemble algorithms. publication-title: Int. J. Pattern Recognit. Artif. Intell. doi: 10.1142/S0218001411008683 – volume: 5 start-page: 2809 year: 2016 ident: ref6 article-title: A data-driven approach to estimating the number of clusters in hierarchical clustering. publication-title: F1000Res. doi: 10.12688/f1000research.10103.1 – year: 2019 ident: ref3 article-title: Machine learning in the analysis of social problems: The case of global human trafficking. publication-title: The British University in Dubai, (Dissertation). – volume: 21 start-page: 3201-3212 year: 08 2005 ident: ref13 article-title: Computational cluster validation in post-genomic data analysis. publication-title: Bioinformatics. doi: 10.1093/bioinformatics/bti517 – volume: 23 start-page: 2888-2896 year: 09 2007 ident: ref10 article-title: Graphbased consensus clustering for class discovery from gene expression data. publication-title: Bioinformatics. doi: 10.1093/bioinformatics/btm463 – start-page: 1176-1181 year: 2012 ident: ref12 article-title: Robust ensemble clustering by matrix completion. publication-title: 2012 IEEE 12th International Conference on Data Mining. doi: 10.1109/ICDM.2012.123 – volume: 2021 year: 2021 ident: ref15 article-title: A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm. publication-title: J. Wireless Com. Network. doi: 10.1186/s13638-021-01910-w – volume: 12 start-page: 2825-2830 year: 2011 ident: ref4 article-title: Scikit-learn: Machine learning in Python. publication-title: J. Mach. Learn. Res. – volume: 53 start-page: 1-18 year: 2013 ident: ref5 article-title: fastcluster: Fast hierarchical, agglomerative clustering routines for r and python. publication-title: J. Stat. Softw. doi: 10.18637/jss.v053.i09 – volume: 4 start-page: 6207 year: 2014 ident: ref11 article-title: Critical limitations of consensus clustering in class discovery. publication-title: Sci. Rep. doi: 10.1038/srep06207 – volume: 215 start-page: 511-529 year: 06 2020 ident: ref1 article-title: Detecting shared genetic architecture among multiple phenotypes by hierarchical clustering of gene-level association statistics. publication-title: Genetics. doi: 10.1534/genetics.120.303096 – volume: 6 start-page: 29241-29253 year: 2018 ident: ref2 article-title: An enhanced clustering-based method for determining time-of-day breakpoints through process optimization. publication-title: IEEE Access. doi: 10.1109/ACCESS.2018.2843564 – volume: 10 start-page: 1227-1246 year: 2019 ident: ref9 article-title: Clustering ensemble method. publication-title: Int. J. Mach. Learn. Cybern. doi: 10.1007/s13042-017-0756-7 |
SSID | ssj0000993627 |
Score | 2.205992 |
Snippet | Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and... |
SourceID | crossref faculty1000 |
SourceType | Index Database Publisher |
StartPage | 573 |
Title | Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved] |
URI | http://dx.doi.org/10.12688/f1000research.121486.1 |
Volume | 11 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA4yQfQwdCrOHzMHwVNH06ZpdxAR2RzCPDnYreTHyzbpurlu4P57k7ZjG4jipYc2beElefle8t73IXQnQQpXR8xRIQXHDArqtED7Tkh0pEEySfM67t4b6_bp6yAYbCrrSgNmP4Z2Vk-qP0-aX5-rRzPhH3JuBGYiOG03qUtynJHlS6ARa5qQaN8sT6GVNeiVmP-jgETGadsyas_Ehg7JM3xu__jWzqp1pLmlxVjZlluLUecYVUsUiZ-Kbj9Be5DW0EGvPCevoWouTHmf4TJN8BTxdprBRCSAC81obMAqlsnS8iTgQhYEq3VqjO0szFOFeTKczseL0QRnuV6OvT9O8TLNljPrZDJQuNSdGJ6hfqf9_tx1SnkFRxo3RRwzt7n2I8FF6PvALNYTWhkAp1wtWp70Qq6lDCgD0tKUSRky5knwlEGMEGjln6NKOk3hAmFClMtc0Jz7mqrAni0qT7QgBBJoX4g6ctfmi2cFi0Zsow9r8XjH4nFh8ZjUEd0yc7x5_vtrl___0xU69GxNQ76vco0qi_kSbgzSWIhGHqGb68uANPJR9A3PNNJm |
linkProvider | Scholars Portal |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Ensemble+method+for+cluster+number+determination+and+algorithm+selection+in+unsupervised+learning&rft.jtitle=F1000+research&rft.au=Zambelli%2C+Antoine&rft.date=2022&rft.issn=2046-1402&rft.eissn=2046-1402&rft.volume=11&rft.spage=573&rft_id=info:doi/10.12688%2Ff1000research.121486.1&rft.externalDBID=n%2Fa&rft.externalDocID=10_12688_f1000research_121486_1 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2046-1402&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2046-1402&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2046-1402&client=summon |