Effective Clustering Analysis Based on New Designed Clustering Validity Index and Revised K-Means Algorithm for Big Data
Clustering tries to find the natural structure of input datasets and partitions them into groups or clusters. As an unsupervised pattern classification method, it has been widely used in data mining, pattern recognition, image processing and so on. However, many of the existing clustering algorithms...
Saved in:
Published in | 2018 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Ubiquitous Computing & Communications, Big Data & Cloud Computing, Social Computing & Networking, Sustainable Computing & Communications (ISPA/IUCC/BDCloud/SocialCom/SustainCom) pp. 96 - 102 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Clustering tries to find the natural structure of input datasets and partitions them into groups or clusters. As an unsupervised pattern classification method, it has been widely used in data mining, pattern recognition, image processing and so on. However, many of the existing clustering algorithms are suffering from many obstacles, such as low efficiency, poor clustering accuracy, more sensitive to noise points and cannot deal with complex big data properly. Aiming at these problems, an improved K-means algorithm (Grid-K-means) is firstly proposed. In the algorithm, dynamically changing grids operations are used to substitute data point operations to improve the clustering efficiency and reduce the number of manually setting initial parameters. Meanwhile, by utilizing grids with the highest density to determine the initial clustering centers, more accurate and stable clustering results are acquired. Then, based on the idea of utilizing grid as the weighted representative point to process the dataset, a new clustering validity index (BCVI) is introduced to better evaluate the quality of clustering results. BCVI can quickly determine the optimal clustering number especially for large-scale datasets. Experimental results on testing 5 simulated datasets (including two large sample data sets) have demonstrated that the Grid-K-means algorithm is faster and more accurate than the traditional ones. Meanwhile, the clustering results are evaluated by our BCVI and 6 other existing clustering validity indexes. The experimental results have also shown that the new BCVI is superior to traditional indexes in data processing speed and stability. |
---|---|
DOI: | 10.1109/BDCloud.2018.00027 |