An efficient K-means clustering algorithm for tall data

The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element o...

Full description

Saved in:

Bibliographic Details
Published in	Data mining and knowledge discovery Vol. 34; no. 3; pp. 776 - 811
Main Authors	Capó, Marco, Pérez, Aritz, Lozano, Jose A.
Format	Journal Article
Language	English
Published	New York Springer US 01.05.2020 Springer Nature B.V
Subjects	Algorithms Approximation Artificial Intelligence Chemistry and Earth Sciences Cluster analysis Clustering Computer Science Data analysis Data Mining and Knowledge Discovery Datasets Information Storage and Retrieval Initial conditions Journal Track of ECML PKDD 2020 Machine learning Massive data points Mathematical analysis Parallel processing Physics Statistics for Engineering Vector quantization means means problem Lloyd’s algorithm Unsupervised learning Coresets
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The analysis of continously larger datasets is a task of major importance in a wide variety of scientific fields. Therefore, the development of efficient and parallel algorithms to perform such an analysis is a a crucial topic in unsupervised learning. Cluster analysis algorithms are a key element of exploratory data analysis and, among them, the K -means algorithm stands out as the most popular approach due to its easiness in the implementation, straightforward parallelizability and relatively low computational cost. Unfortunately, the K -means algorithm also has some drawbacks that have been extensively studied, such as its high dependency on the initial conditions, as well as to the fact that it might not scale well on massive datasets. In this article, we propose a recursive and parallel approximation to the K -means algorithm that scales well on the number of instances of the problem, without affecting the quality of the approximation. In order to achieve this, instead of analyzing the entire dataset, we work on small weighted sets of representative points that are distributed in such a way that more importance is given to those regions where it is harder to determine the correct cluster assignment of the original instances. In addition to different theoretical properties, which explain the reasoning behind the algorithm, experimental results indicate that our method outperforms the state-of-the-art in terms of the trade-off between number of distance computations and the quality of the solution obtained.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1384-5810 1573-756X
DOI:	10.1007/s10618-020-00678-9