CM-tree: A dynamic clustered index for similarity search in metric databases
Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance fu...
Saved in:
Published in | Data & knowledge engineering Vol. 63; no. 3; pp. 919 - 946 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.12.2007
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Repositories of unstructured data types, such as free text, images, audio and video, have been recently emerging in various fields. A general searching approach for such data types is that of similarity search, where the search is for similar objects and similarity is modeled by a metric distance function. In this article we propose a new dynamic paged and balanced access method for similarity search in metric data sets, named CM-tree (Clustered Metric tree). It fully supports dynamic capabilities of insertions and deletions both of single objects and in bulk. Distinctive from other methods, it is especially designed to achieve a structure of tight and low overlapping clusters via its primary construction algorithms (instead of post-processing), yielding significantly improved performance. Several new methods are introduced to achieve this: a strategy for selecting representative objects of nodes, clustering based node split algorithm and criteria for triggering a node split, and an improved sub-tree pruning method used during search. To facilitate these methods the pairwise distances between the objects of a node are maintained within each node. Results from an extensive experimental study show that the CM-tree outperforms the M-tree and the Slim-tree, improving search performance by up to 312% for I/O costs and 303% for CPU costs. |
---|---|
ISSN: | 0169-023X 1872-6933 |
DOI: | 10.1016/j.datak.2007.06.001 |