Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]

Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determin...

Full description

Saved in:
Bibliographic Details
Published inF1000 research Vol. 11; p. 573
Main Author Zambelli, Antoine
Format Journal Article
LanguageEnglish
Published 2022
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms; all of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusions for ease of integration.
AbstractList Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and informed decisions on which algorithm to use with which set of hyperparameters for a given dataset. Additionally, researchers may need to determine the number of clusters in the dataset, which is unfortunately itself an input to most clustering algorithms; all of this before embarking on their actual subject matter work. After quantifying the impact of algorithm and hyperparameter selection, we propose an ensemble clustering framework which can be leveraged with minimal input. It can be used to determine both the number of clusters in the dataset and a suitable choice of algorithm to use for a given dataset. A code library is included in the Conclusions for ease of integration.
Author Zambelli, Antoine
Author_xml – sequence: 1
  givenname: Antoine
  orcidid: 0000-0001-9635-0061
  surname: Zambelli
  fullname: Zambelli, Antoine
  email: antoine.zambelli@gmail.com
  organization: University of California, Berkeley, Berkeley, CA, 94720, USA
BookMark eNqFkEtOwzAURS1UJErpGvACaLGd1EnoCFXlI1ViAiOEIsd5bo0SO7KTVN0Ry8RtEZ8Ro_fRuffZ9xwNjDWA0CUlU8p4ml4rSghx4EE4uQk7Gqd8Sk_QkJGYT2hM2OBXf4bG3r8HBcmyiLNkiD6WxkNdVIBraDe2xMo6LKvOt-Cw6eoilBLCUGsjWm0NFqbEolpbp9tNjT1UIA97bXBnfNeA67WHElfhSUabNX7twfk9Qee4geDnoNewvcEMi6Zxtg_wNpjh_Tdcf7jirzDFxrbfxNsFOlWi8jD-qiP0crd8XjxMVk_3j4vb1UQymtFJEhGhorQQRRJFwAnNWKHKmMxKooqMSZYIJeUs5kAzFXMpE86ZBFYmSQwzVUYjlBx9pbPeO1B543Qt3C6nJD9knv_JPD9mntOgnB-VSsiuand7Kv_B_lF_AlQQkQI
Cites_doi 10.1023/A:1023949509487
10.1142/S0218001411008683
10.12688/f1000research.10103.1
10.1093/bioinformatics/bti517
10.1093/bioinformatics/btm463
10.1109/ICDM.2012.123
10.1186/s13638-021-01910-w
10.18637/jss.v053.i09
10.1038/srep06207
10.1534/genetics.120.303096
10.1109/ACCESS.2018.2843564
10.1007/s13042-017-0756-7
ContentType Journal Article
Copyright Copyright: © 2022 Zambelli A
Copyright_xml – notice: Copyright: © 2022 Zambelli A
DBID C-E
CH4
AAYXX
CITATION
DOI 10.12688/f1000research.121486.1
DatabaseName F1000Research
Faculty of 1000
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList CrossRef

DeliveryMethod fulltext_linktorsrc
Discipline Medicine
Women's Studies
EISSN 2046-1402
ExternalDocumentID 10_12688_f1000research_121486_1
GroupedDBID 3V.
53G
5VS
7X7
88I
8FE
8FH
8FI
8FJ
AAFWJ
ABUWG
ACGOD
ACPRK
ADBBV
ADRAZ
AFKRA
AFPKN
AHMBA
ALMA_UNASSIGNED_HOLDINGS
AOIJS
AZQEC
BAWUL
BBNVY
BCNDV
BENPR
BHPHI
BPHCQ
BVXVI
C-E
CCPQU
CH4
DIK
DWQXO
FRP
FYUFA
GNUQQ
GROUPED_DOAJ
GX1
HCIFZ
HMCUK
HYE
KQ8
LK8
M2P
M48
M7P
M~E
OK1
PIMPY
PQQKQ
PROAC
RPM
UKHRP
W2D
AAYXX
ALIPV
CITATION
PGMZT
PHGZM
PHGZT
ID FETCH-LOGICAL-c2191-730af38bab733e60192bfd405d0fb92c27afcc546e19f46cc7662ce2d774e5fd3
IEDL.DBID M48
ISSN 2046-1402
IngestDate Tue Jul 01 04:27:33 EDT 2025
Sat Jan 27 03:05:08 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Keywords K-means
Gaussian Mixture
Number of Clusters
Ensemble
Consensus Clustering
Hierarchical Clustering
Spectral Clustering
Unsupervised Learning
Clustering
Language English
License http://creativecommons.org/licenses/by/4.0/: This is an open access article distributed under the terms of the Creative Commons Attribution Licence, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c2191-730af38bab733e60192bfd405d0fb92c27afcc546e19f46cc7662ce2d774e5fd3
ORCID 0000-0001-9635-0061
OpenAccessLink http://journals.scholarsportal.info/openUrl.xqy?doi=10.12688/f1000research.121486.1
ParticipantIDs crossref_primary_10_12688_f1000research_121486_1
faculty1000_research_10_12688_f1000research_121486_1
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2022
2022-00-00
PublicationDateYYYYMMDD 2022-01-01
PublicationDate_xml – year: 2022
  text: 2022
PublicationDecade 2020
PublicationTitle F1000 research
PublicationYear 2022
References S Vega-Pons (ref7) 2011; 25
J Handl (ref13) 08 2005; 21
A Caoli (ref3) 2019
A Zambelli (ref6) 2016; 5
T Alqurashi (ref9) 2019; 10
F Pedregosa (ref4) 2011; 12
X Song (ref2) 2018; 6
D Müllner (ref5) 2013; 53
M McGuirl (ref1) 06 2020; 215
S Monti (ref8) 2003; 52
J Yi (ref12) 2012
Y Șenbabaoğlu (ref11) 2014; 4
Z Yu (ref10) 09 2007; 23
C Shi (ref15) 2021; 2021
References_xml – volume: 52
  start-page: 91-118
  year: 2003
  ident: ref8
  article-title: Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data.
  publication-title: Mach. Learn.
  doi: 10.1023/A:1023949509487
– volume: 25
  start-page: 337-372
  year: 2011
  ident: ref7
  article-title: A survey of custering ensemble algorithms.
  publication-title: Int. J. Pattern Recognit. Artif. Intell.
  doi: 10.1142/S0218001411008683
– volume: 5
  start-page: 2809
  year: 2016
  ident: ref6
  article-title: A data-driven approach to estimating the number of clusters in hierarchical clustering.
  publication-title: F1000Res.
  doi: 10.12688/f1000research.10103.1
– year: 2019
  ident: ref3
  article-title: Machine learning in the analysis of social problems: The case of global human trafficking.
  publication-title: The British University in Dubai, (Dissertation).
– volume: 21
  start-page: 3201-3212
  year: 08 2005
  ident: ref13
  article-title: Computational cluster validation in post-genomic data analysis.
  publication-title: Bioinformatics.
  doi: 10.1093/bioinformatics/bti517
– volume: 23
  start-page: 2888-2896
  year: 09 2007
  ident: ref10
  article-title: Graphbased consensus clustering for class discovery from gene expression data.
  publication-title: Bioinformatics.
  doi: 10.1093/bioinformatics/btm463
– start-page: 1176-1181
  year: 2012
  ident: ref12
  article-title: Robust ensemble clustering by matrix completion.
  publication-title: 2012 IEEE 12th International Conference on Data Mining.
  doi: 10.1109/ICDM.2012.123
– volume: 2021
  year: 2021
  ident: ref15
  article-title: A quantitative discriminant method of elbow point for the optimal number of clusters in clustering algorithm.
  publication-title: J. Wireless Com. Network.
  doi: 10.1186/s13638-021-01910-w
– volume: 12
  start-page: 2825-2830
  year: 2011
  ident: ref4
  article-title: Scikit-learn: Machine learning in Python.
  publication-title: J. Mach. Learn. Res.
– volume: 53
  start-page: 1-18
  year: 2013
  ident: ref5
  article-title: fastcluster: Fast hierarchical, agglomerative clustering routines for r and python.
  publication-title: J. Stat. Softw.
  doi: 10.18637/jss.v053.i09
– volume: 4
  start-page: 6207
  year: 2014
  ident: ref11
  article-title: Critical limitations of consensus clustering in class discovery.
  publication-title: Sci. Rep.
  doi: 10.1038/srep06207
– volume: 215
  start-page: 511-529
  year: 06 2020
  ident: ref1
  article-title: Detecting shared genetic architecture among multiple phenotypes by hierarchical clustering of gene-level association statistics.
  publication-title: Genetics.
  doi: 10.1534/genetics.120.303096
– volume: 6
  start-page: 29241-29253
  year: 2018
  ident: ref2
  article-title: An enhanced clustering-based method for determining time-of-day breakpoints through process optimization.
  publication-title: IEEE Access.
  doi: 10.1109/ACCESS.2018.2843564
– volume: 10
  start-page: 1227-1246
  year: 2019
  ident: ref9
  article-title: Clustering ensemble method.
  publication-title: Int. J. Mach. Learn. Cybern.
  doi: 10.1007/s13042-017-0756-7
SSID ssj0000993627
Score 2.205992
Snippet Unsupervised learning, and more specifically clustering, suffers from the need for expertise in the field to be of use. Researchers must make careful and...
SourceID crossref
faculty1000
SourceType Index Database
Publisher
StartPage 573
Title Ensemble method for cluster number determination and algorithm selection in unsupervised learning [version 1; peer review: 2 approved with reservations, 1 not approved]
URI http://dx.doi.org/10.12688/f1000research.121486.1
Volume 11
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV3PS8MwFA4yQfQwdCrOHzMHwVNH06ZpdxAR2RzCPDnYreTHyzbpurlu4P57k7ZjG4jipYc2beElefle8t73IXQnQQpXR8xRIQXHDArqtED7Tkh0pEEySfM67t4b6_bp6yAYbCrrSgNmP4Z2Vk-qP0-aX5-rRzPhH3JuBGYiOG03qUtynJHlS6ARa5qQaN8sT6GVNeiVmP-jgETGadsyas_Ehg7JM3xu__jWzqp1pLmlxVjZlluLUecYVUsUiZ-Kbj9Be5DW0EGvPCevoWouTHmf4TJN8BTxdprBRCSAC81obMAqlsnS8iTgQhYEq3VqjO0szFOFeTKczseL0QRnuV6OvT9O8TLNljPrZDJQuNSdGJ6hfqf9_tx1SnkFRxo3RRwzt7n2I8FF6PvALNYTWhkAp1wtWp70Qq6lDCgD0tKUSRky5knwlEGMEGjln6NKOk3hAmFClMtc0Jz7mqrAni0qT7QgBBJoX4g6ctfmi2cFi0Zsow9r8XjH4nFh8ZjUEd0yc7x5_vtrl___0xU69GxNQ76vco0qi_kSbgzSWIhGHqGb68uANPJR9A3PNNJm
linkProvider Scholars Portal
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Ensemble+method+for+cluster+number+determination+and+algorithm+selection+in+unsupervised+learning&rft.jtitle=F1000+research&rft.au=Zambelli%2C+Antoine&rft.date=2022&rft.issn=2046-1402&rft.eissn=2046-1402&rft.volume=11&rft.spage=573&rft_id=info:doi/10.12688%2Ff1000research.121486.1&rft.externalDBID=n%2Fa&rft.externalDocID=10_12688_f1000research_121486_1
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2046-1402&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2046-1402&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2046-1402&client=summon