CM-tree: A dynamic clustered index for similarity search in metric databases

Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance fu...

Full description

Saved in:

Bibliographic Details
Published in	Data & knowledge engineering Vol. 63; no. 3; pp. 919 - 946
Main Authors	Aronovich, Lior, Spiegler, Israel
Format	Journal Article
Language	English
Published	Elsevier B.V 01.12.2007
Subjects	Clustering methods Database indexing Metric access methods Metric spaces Similarity search Metric spaces Clustering methods Metric access methods Similarity search Database indexing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs.
ISSN:	0169-023X 1872-6933
DOI:	10.1016/j.datak.2007.06.001