A fast clustering algorithm based on pruning unnecessary distance computations in DBSCAN for high-dimensional data

•The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.•NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same....

Full description

Saved in:

Bibliographic Details
Published in	Pattern recognition Vol. 83; pp. 375 - 387
Main Authors	Chen, Yewang, Tang, Shengyu, Bouguila, Nizar, Wang, Cheng, Du, Jixiang, Li, HaiLin
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.11.2018
Subjects	DBSCAN NQ-DBSCAN ρ-Approximate DBSCAN ρ-Approximate DBSCAN 99-00 DBSCAN 00-01 NQ-DBSCAN
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•The underlying idea is: point p and point q should have similar neighbors, provided p and q are close to each other; given a certain eps, the closer they are, the more similar their neighbors are.•NQ-DBSCAN is an exact algorithm that may return the same result as DBSCAN if the parameters are same. While ρ-Approximate DBSCAN is an approximate algorithm.•The best complexity of NQ-DBSCAN can be O(n), and the average complexity of NQ-DBSCAN is proved to be O(n log(n)) provided the parameters are properly chosen. While ρ-Approximate DBSCAN runs only in O(n2) in high dimension.•NQ-DBSCAN is suitable for clustering data with a lot of noise. Clustering is an important technique to deal with large scale data which are explosively created in internet. Most data are high-dimensional with a lot of noise, which brings great challenges to retrieval, classification and understanding. No current existing approach is “optimal” for large scale data. For example, DBSCAN requires O(n2) time, Fast-DBSCAN only works well in 2 dimensions, and ρ-Approximate DBSCAN runs in O(n) expected time which needs dimension D to be a relative small constant for the linear running time to hold. However, we prove theoretically and experimentally that ρ-Approximate DBSCAN degenerates to an O(n2) algorithm in very high dimension such that 2D >  > n. In this paper, we propose a novel local neighborhood searching technique, and apply it to improve DBSCAN, named as NQ-DBSCAN, such that a large number of unnecessary distance computations can be effectively reduced. Theoretical analysis and experimental results show that NQ-DBSCAN averagely runs in O(n*log(n)) with the help of indexing technique, and the best case is O(n) if proper parameters are used, which makes it suitable for many realtime data.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2018.05.030