Grid-based DBSCAN: Indexing and inference
•The proposed method extends grid-based DBSCAN scalable to higher-dimensional dataset.•Cluster forest is devised to alleviate redundancies in the merging step.•HyperGrid Bitmap is used to index non-empty grids for efficient neighbor grid queries.•Experiments show performance superiority of proposed...
Saved in:
Published in | Pattern recognition Vol. 90; pp. 271 - 284 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.06.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •The proposed method extends grid-based DBSCAN scalable to higher-dimensional dataset.•Cluster forest is devised to alleviate redundancies in the merging step.•HyperGrid Bitmap is used to index non-empty grids for efficient neighbor grid queries.•Experiments show performance superiority of proposed method on real/synthetic data.
DBSCAN is one of clustering algorithms which can report arbitrarily-shaped clusters and noises without requiring the number of clusters as a parameter (unlike the other clustering algorithms, k-means, for example). Because the running time of DBSCAN has quadratic order of growth, i.e. O(n2), research studies on improving its performance have been received a considerable amount of attention for decades. Grid-based DBSCAN is a well-developed algorithm whose complexity is improved to O(nlog n) in 2D space, while requiring Ω(n4/3) to solve when dimension ≥ 3. However, we find that Grid-based DBSCAN suffers from two problems: neighbour explosion and redundancies in merging, which make the algorithms infeasible in high dimensional space. In this paper we first propose a novel algorithm called GDCF which utilizes bitmap indexing to support efficient neighbour grid queries. Second, based on the concept of union-find algorithm we devise a forest-like structure, called cluster forest, to alleviate the redundancies in the merging. Moreover, we find that running the cluster forest in different orders can lead to a different number of merging operations needed to perform in the merging step. We propose to perform the merging step in a uniform random order to optimize the number of merging operations. However, for high-density database, a bottleneck could be occurred, we further propose a low-density-first order to alleviate this bottleneck. The experiments resulted on both real-world and synthetic datasets demonstrate that the proposed algorithm outperforms the state-of-the-art exact/approximate DBSCAN and suggests a good scalability. |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2019.01.034 |