Grid-based DBSCAN: Indexing and inference

•The proposed method extends grid-based DBSCAN scalable to higher-dimensional dataset.•Cluster forest is devised to alleviate redundancies in the merging step.•HyperGrid Bitmap is used to index non-empty grids for efficient neighbor grid queries.•Experiments show performance superiority of proposed...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition Vol. 90; pp. 271 - 284
Main Authors Boonchoo, Thapana, Ao, Xiang, Liu, Yang, Zhao, Weizhong, Zhuang, Fuzhen, He, Qing
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.06.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•The proposed method extends grid-based DBSCAN scalable to higher-dimensional dataset.•Cluster forest is devised to alleviate redundancies in the merging step.•HyperGrid Bitmap is used to index non-empty grids for efficient neighbor grid queries.•Experiments show performance superiority of proposed method on real/synthetic data. DBSCAN is one of clustering algorithms which can report arbitrarily-shaped clusters and noises without requiring the number of clusters as a parameter (unlike the other clustering algorithms, k-means, for example). Because the running time of DBSCAN has quadratic order of growth, i.e. O(n2), research studies on improving its performance have been received a considerable amount of attention for decades. Grid-based DBSCAN is a well-developed algorithm whose complexity is improved to O(nlog n) in 2D space, while requiring Ω(n4/3) to solve when dimension  ≥ 3. However, we find that Grid-based DBSCAN suffers from two problems: neighbour explosion and redundancies in merging, which make the algorithms infeasible in high dimensional space. In this paper we first propose a novel algorithm called GDCF which utilizes bitmap indexing to support efficient neighbour grid queries. Second, based on the concept of union-find algorithm we devise a forest-like structure, called cluster forest, to alleviate the redundancies in the merging. Moreover, we find that running the cluster forest in different orders can lead to a different number of merging operations needed to perform in the merging step. We propose to perform the merging step in a uniform random order to optimize the number of merging operations. However, for high-density database, a bottleneck could be occurred, we further propose a low-density-first order to alleviate this bottleneck. The experiments resulted on both real-world and synthetic datasets demonstrate that the proposed algorithm outperforms the state-of-the-art exact/approximate DBSCAN and suggests a good scalability.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2019.01.034