An effective clustering scheme for high-dimensional data

While the classical K -means algorithm has been widely used in many fields, it still has some defects. Therefore, this paper proposes a scheme to improve the clustering quality of K -means algorithm. The farthest initial center selection and the min–max rule are used to improve the random initializa...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 83; no. 15; pp. 45001 - 45045
Main Authors	He, Xuansen, He, Fan, Fan, Yueping, Jiang, Lingmin, Liu, Runzong, Maalla, Allam
Format	Journal Article
Language	English
Published	New York Springer US 01.05.2024 Springer Nature B.V
Subjects	Algorithms Cluster analysis Clustering Computer Communication Networks Computer Science Data Structures and Information Theory Datasets Discriminant analysis Empirical analysis Multimedia Information Systems Normal distribution Special Purpose and Application-Based Systems Linear discriminant analysis Silhouette analysis Clustering validity function Initial center selection Empirical rule means algorithm
Online Access	Get full text

Cover

Loading…

More Information
Summary:	While the classical K -means algorithm has been widely used in many fields, it still has some defects. Therefore, this paper proposes a scheme to improve the clustering quality of K -means algorithm. The farthest initial center selection and the min–max rule are used to improve the random initialization of K -means algorithm, which can avoid the empty clusters in the clustering results. For high-dimensional data sets, standardized feature scaling makes the data subject to normal distribution, and supervised linear discriminant analysis (LDA) is used to effectively reduce the data dimension and facilitate visualization. The empirical rule is used to estimate the range of the number of clusters. Within this range, the number of clusters of data is visually estimated by searching the elbow of the sum-of-squared-errors (SSE) curve. Further, a novel clustering validity function f ( K ) is proposed to determine the optimal number of clusters for complex real-world data sets. Through silhouette analysis, the clustering quality can be intuitively evaluated by calculating the silhouette coefficient of cluster and observing its size. The simulation results of different types of data sets show that this scheme can not only improve the clustering quality of K -means algorithm, but also provide a visual cluster analysis method for high-dimensional data sets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1573-7721 1380-7501 1573-7721
DOI:	10.1007/s11042-023-17129-4