Grid-based DBSCAN: Indexing and inference

•The proposed method extends grid-based DBSCAN scalable to higher-dimensional dataset.•Cluster forest is devised to alleviate redundancies in the merging step.•HyperGrid Bitmap is used to index non-empty grids for efficient neighbor grid queries.•Experiments show performance superiority of proposed...

Full description

Saved in:

Bibliographic Details
Published in	Pattern recognition Vol. 90; pp. 271 - 284
Main Authors	Boonchoo, Thapana, Ao, Xiang, Liu, Yang, Zhao, Weizhong, Zhuang, Fuzhen, He, Qing
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.06.2019
Subjects	Density-based clustering Grid-based DBSCAN Union-find algorithm Union-find algorithm Density-based clustering Grid-based DBSCAN
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•The proposed method extends grid-based DBSCAN scalable to higher-dimensional dataset.•Cluster forest is devised to alleviate redundancies in the merging step.•HyperGrid Bitmap is used to index non-empty grids for efficient neighbor grid queries.•Experiments show performance superiority of proposed method on real/synthetic data. DBSCAN is one of clustering algorithms which can report arbitrarily-shaped clusters and noises without requiring the number of clusters as a parameter (unlike the other clustering algorithms, k-means, for example). Because the running time of DBSCAN has quadratic order of growth, i.e. O(n2), research studies on improving its performance have been received a considerable amount of attention for decades. Grid-based DBSCAN is a well-developed algorithm whose complexity is improved to O(nlog n) in 2D space, while requiring Ω(n4/3) to solve when dimension  ≥ 3. However, we find that Grid-based DBSCAN suffers from two problems: neighbour explosion and redundancies in merging, which make the algorithms infeasible in high dimensional space. In this paper we first propose a novel algorithm called GDCF which utilizes bitmap indexing to support efficient neighbour grid queries. Second, based on the concept of union-find algorithm we devise a forest-like structure, called cluster forest, to alleviate the redundancies in the merging. Moreover, we find that running the cluster forest in different orders can lead to a different number of merging operations needed to perform in the merging step. We propose to perform the merging step in a uniform random order to optimize the number of merging operations. However, for high-density database, a bottleneck could be occurred, we further propose a low-density-first order to alleviate this bottleneck. The experiments resulted on both real-world and synthetic datasets demonstrate that the proposed algorithm outperforms the state-of-the-art exact/approximate DBSCAN and suggests a good scalability.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2019.01.034