An effective clustering scheme for high-dimensional data

While the classical K -means algorithm has been widely used in many fields, it still has some defects. Therefore, this paper proposes a scheme to improve the clustering quality of K -means algorithm. The farthest initial center selection and the min–max rule are used to improve the random initializa...

Full description

Saved in:
Bibliographic Details
Published inMultimedia tools and applications Vol. 83; no. 15; pp. 45001 - 45045
Main Authors He, Xuansen, He, Fan, Fan, Yueping, Jiang, Lingmin, Liu, Runzong, Maalla, Allam
Format Journal Article
LanguageEnglish
Published New York Springer US 01.05.2024
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:While the classical K -means algorithm has been widely used in many fields, it still has some defects. Therefore, this paper proposes a scheme to improve the clustering quality of K -means algorithm. The farthest initial center selection and the min–max rule are used to improve the random initialization of K -means algorithm, which can avoid the empty clusters in the clustering results. For high-dimensional data sets, standardized feature scaling makes the data subject to normal distribution, and supervised linear discriminant analysis (LDA) is used to effectively reduce the data dimension and facilitate visualization. The empirical rule is used to estimate the range of the number of clusters. Within this range, the number of clusters of data is visually estimated by searching the elbow of the sum-of-squared-errors (SSE) curve. Further, a novel clustering validity function f ( K ) is proposed to determine the optimal number of clusters for complex real-world data sets. Through silhouette analysis, the clustering quality can be intuitively evaluated by calculating the silhouette coefficient of cluster and observing its size. The simulation results of different types of data sets show that this scheme can not only improve the clustering quality of K -means algorithm, but also provide a visual cluster analysis method for high-dimensional data sets.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1573-7721
1380-7501
1573-7721
DOI:10.1007/s11042-023-17129-4