GFilter: A General Gram Filter for String Similarity Search

Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework t...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on knowledge and data engineering Vol. 27; no. 4; pp. 1005 - 1018
Main Authors	Hu, Haoji, Zheng, Kai, Wang, Xiaoling, Zhou, Aoying
Format	Journal Article
Language	English
Published	New York IEEE 01.04.2015 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Collection Data integration Educational institutions Greedy algorithms Heuristic Heuristic methods Indexes Proteins Query processing Radiation detectors Search problems Searching Similarity Strings gram-based framework Data integration similarity search
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1041-4347 1558-2191
DOI:	10.1109/TKDE.2014.2349914