A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

•The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.•NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same....

Full description

Saved in:
Bibliographic Details
Published inPattern recognition Vol. 83; pp. 375 - 387
Main Authors Chen, Yewang, Tang, Shengyu, Bouguila, Nizar, Wang, Cheng, Du, Jixiang, Li, HaiLin
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.11.2018
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.•NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.•The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.•NQ-DBSCAN is suitable for clustering data with a lot of noise. Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.
ISSN:0031-3203
1873-5142
DOI:10.1016/j.patcog.2018.05.030