Ultra-Scalable Spectral Clustering and Ensemble Clustering

This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representativ...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on knowledge and data engineering Vol. 32; no. 6; pp. 1212 - 1226
Main Authors	Huang, Dong, Wang, Chang-Dong, Wu, Jian-Sheng, Lai, Jian-Huang, Kwoh, Chee-Keong
Format	Journal Article
Language	English
Published	New York IEEE 01.06.2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Approximation algorithms Bipartite graph Clustering Clustering algorithms Complexity theory Data clustering Datasets ensemble clustering Graph theory large-scale clustering large-scale datasets nonlinearly separable datasets Robustness Scalability Sparse matrices Spectra spectral clustering
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper focuses on scalability and robustness of spectral clustering for extremely large-scale datasets with limited resources. Two novel algorithms are proposed, namely, ultra-scalable spectral clustering (U-SPEC) and ultra-scalable ensemble clustering (U-SENC). In U-SPEC, a hybrid representative selection strategy and a fast approximation method for <inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-2903410.gif"/> </inline-formula>-nearest representatives are proposed for the construction of a sparse affinity sub-matrix. By interpreting the sparse sub-matrix as a bipartite graph, the transfer cut is then utilized to efficiently partition the graph and obtain the clustering result. In U-SENC, multiple U-SPEC clusterers are further integrated into an ensemble clustering framework to enhance the robustness of U-SPEC while maintaining high efficiency. Based on the ensemble generation via multiple U-SEPC's, a new bipartite graph is constructed between objects and base clusters and then efficiently partitioned to achieve the consensus clustering result. It is noteworthy that both U-SPEC and U-SENC have nearly linear time and space complexity, and are capable of robustly and efficiently partitioning 10-million-level nonlinearly-separable datasets on a PC with 64 GB memory. Experiments on various large-scale datasets have demonstrated the scalability and robustness of our algorithms. The MATLAB code and experimental data are available at https://www.researchgate.net/publication/330760669 .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2019.2903410