GFilter: A General Gram Filter for String Similarity Search

Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework t...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on knowledge and data engineering Vol. 27; no. 4; pp. 1005 - 1018
Main Authors Hu, Haoji, Zheng, Kai, Wang, Xiaoling, Zhou, Aoying
Format Journal Article
LanguageEnglish
Published New York IEEE 01.04.2015
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
AbstractList Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches.
Author Haoji Hu
Xiaoling Wang
Aoying Zhou
Kai Zheng
Author_xml – sequence: 1
  givenname: Haoji
  surname: Hu
  fullname: Hu, Haoji
– sequence: 2
  givenname: Kai
  surname: Zheng
  fullname: Zheng, Kai
– sequence: 3
  givenname: Xiaoling
  surname: Wang
  fullname: Wang, Xiaoling
– sequence: 4
  givenname: Aoying
  surname: Zhou
  fullname: Zhou, Aoying
BookMark eNp9kEFPwkAQhTcGEwH9AcZLEy9eijOd3XZXTwShGkk8gOfNUra6pLS4LQf-vSUlHjh4msnM915e3oD1yqq0jN0ijBBBPS7fX6ajCJCPIuJKIb9gfRRChhEq7LU7cAw58eSKDep6AwAykdhnz-nMFY31T8E4SG1pvSmC1Jtt0J2DvPLBovGu_AoWbusK411zCBbW-Oz7ml3mpqjtzWkO2edsupy8hvOP9G0ynocZRXETUrJeRTy3mBDCSshYKIOC8vYrI06CGx6DEpApIC4toYWcizgiI0W2Xic0ZA-d785XP3tbN3rr6swWhSltta81Jqq1AiJs0fszdFPtfdmm0xjHRMAJREslHZX5qq69zXXmGtO4qmy8cYVG0MdS9bFUfSxVn0ptlXim3Hm3Nf7wr-au0zhr7R8fSwltbvoF3N-APw
CODEN ITKEEH
CitedBy_id crossref_primary_10_1016_j_eswa_2020_113403
crossref_primary_10_1007_s10489_020_01778_1
crossref_primary_10_3233_IDA_216325
crossref_primary_10_1016_j_comcom_2019_06_011
crossref_primary_10_1007_s10489_019_01616_z
crossref_primary_10_1109_ACCESS_2018_2832209
Cites_doi 10.14778/1687627.1687630
10.14778/1920841.1920938
10.1145/1807167.1807266
10.1145/509961.509965
10.1109/ICDE.2008.4497435
10.1016/S0169-7552(97)00031-7
10.1145/1376616.1376655
10.14778/1920841.1920992
10.1109/ICDE.2012.68
10.1145/1989323.1989431
10.1109/ICDE.2006.9
10.1109/ICDE.2008.4497434
10.1109/TKDE.2012.79
10.1145/1559845.1559919
10.1145/1367497.1367516
10.14778/1453856.1453957
10.1145/1559845.1559925
10.1145/2213836.2213847
10.1145/1242572.1242591
10.1145/1242524.1242529
10.1145/375360.375365
10.14778/1978665.1978666
10.1145/872757.872770
10.14778/2078331.2078340
10.1006/jmbi.1990.9999
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2015
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2015
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
F28
FR3
DOI 10.1109/TKDE.2014.2349914
DatabaseName IEEE Xplore (IEEE)
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ANTE: Abstracts in New Technology & Engineering
Engineering Research Database
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
Engineering Research Database
ANTE: Abstracts in New Technology & Engineering
DatabaseTitleList Technology Research Database
Technology Research Database

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2191
EndPage 1018
ExternalDocumentID 3623503401
10_1109_TKDE_2014_2349914
6880793
Genre orig-research
GrantInformation_xml – fundername: NSFC
  grantid: 61033007; 61170085; 61021004
  funderid: 10.13039/501100001809
– fundername: Shanghai Knowledge Service Platform
  grantid: ZF1213
– fundername: 973 project
  grantid: 2010CB328106
– fundername: Shanghai Leading Academic Discipline
  grantid: B412
GroupedDBID -~X
.DC
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AGQYO
AGSQL
AHBIQ
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
F5P
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
RXW
TAE
TN5
UHB
AAYXX
CITATION
RIG
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
F28
FR3
ID FETCH-LOGICAL-c326t-37db24fe17310b58659a153fc32824354a460950c90348e31e0f45623a85cdd73
IEDL.DBID RIE
ISSN 1041-4347
IngestDate Thu Jul 10 16:42:14 EDT 2025
Mon Jun 30 03:02:43 EDT 2025
Thu Apr 24 23:07:42 EDT 2025
Tue Jul 01 03:14:35 EDT 2025
Wed Aug 27 02:52:16 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 4
Keywords gram-based framework
Data integration
similarity search
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c326t-37db24fe17310b58659a153fc32824354a460950c90348e31e0f45623a85cdd73
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
PQID 1663304305
PQPubID 85438
PageCount 14
ParticipantIDs crossref_primary_10_1109_TKDE_2014_2349914
proquest_miscellaneous_1793280331
proquest_journals_1663304305
ieee_primary_6880793
crossref_citationtrail_10_1109_TKDE_2014_2349914
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2015-April-1
2015-4-1
20150401
PublicationDateYYYYMMDD 2015-04-01
PublicationDate_xml – month: 04
  year: 2015
  text: 2015-April-1
  day: 01
PublicationDecade 2010
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on knowledge and data engineering
PublicationTitleAbbrev TKDE
PublicationYear 2015
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref15
gravano (ref1) 2001
ref14
ref31
ref10
arasu (ref16) 2006
ref2
ref17
ref18
bocek (ref20) 2007
deng (ref30) 2013
ref24
ref23
ref26
ref25
ref22
ref21
broder (ref11) 1997
ref28
ref27
wang (ref19) 2012; 25
ref29
ref7
ref9
ref4
ref3
ref6
ref5
li (ref8) 2007
References_xml – ident: ref14
  doi: 10.14778/1687627.1687630
– ident: ref15
  doi: 10.14778/1920841.1920938
– start-page: 925
  year: 2013
  ident: ref30
  article-title: Top-k string similarity search with edit-distance constraints
  publication-title: Proc IEEE 29th Int Conf Data Eng
– ident: ref17
  doi: 10.1145/1807167.1807266
– ident: ref28
  doi: 10.1145/509961.509965
– ident: ref2
  doi: 10.1109/ICDE.2008.4497435
– ident: ref27
  doi: 10.1016/S0169-7552(97)00031-7
– ident: ref9
  doi: 10.1145/1376616.1376655
– start-page: 21
  year: 1997
  ident: ref11
  article-title: On the resemblance and containment of documents
  publication-title: Proc Compression Complexity Sequences
– start-page: 918
  year: 2006
  ident: ref16
  article-title: Efficient exact set-similarity joins
  publication-title: Proc Int Conf Very Large Data Bases
– ident: ref23
  doi: 10.14778/1920841.1920992
– ident: ref24
  doi: 10.1109/ICDE.2012.68
– ident: ref6
  doi: 10.1145/1989323.1989431
– ident: ref5
  doi: 10.1109/ICDE.2006.9
– ident: ref31
  doi: 10.1109/ICDE.2008.4497434
– volume: 25
  start-page: 1916
  year: 2012
  ident: ref19
  article-title: VChunkJoin: An efficient algorithm for edit similarity joins
  publication-title: IEEE Trans Knowl Data Eng
  doi: 10.1109/TKDE.2012.79
– start-page: 491
  year: 2001
  ident: ref1
  article-title: Approximate string joins in a database (almost) for free
  publication-title: Proc 27th Int Conf Very Large Data Bases
– ident: ref22
  doi: 10.1145/1559845.1559919
– ident: ref10
  doi: 10.1145/1367497.1367516
– ident: ref4
  doi: 10.14778/1453856.1453957
– ident: ref21
  doi: 10.1145/1559845.1559925
– ident: ref7
  doi: 10.1145/2213836.2213847
– ident: ref13
  doi: 10.1145/1242572.1242591
– year: 2007
  ident: ref20
  article-title: Fast similarity search in large dictionaries
– ident: ref25
  doi: 10.1145/1242524.1242529
– ident: ref12
  doi: 10.1145/375360.375365
– ident: ref3
  doi: 10.14778/1978665.1978666
– start-page: 303
  year: 2007
  ident: ref8
  article-title: VGRAM: Improving performance of approximate queries on string collections using variable length grams
  publication-title: Proc Int Conf Very Large Data Bases
– ident: ref18
  doi: 10.1145/872757.872770
– ident: ref29
  doi: 10.14778/2078331.2078340
– ident: ref26
  doi: 10.1006/jmbi.1990.9999
SSID ssj0008781
Score 2.1731646
Snippet Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 1005
SubjectTerms Collection
Data integration
Educational institutions
Greedy algorithms
Heuristic
Heuristic methods
Indexes
Proteins
Query processing
Radiation detectors
Search problems
Searching
Similarity
Strings
Title GFilter: A General Gram Filter for String Similarity Search
URI https://ieeexplore.ieee.org/document/6880793
https://www.proquest.com/docview/1663304305
https://www.proquest.com/docview/1793280331
Volume 27
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLbGTnBgMEAMBgoSJ0RH26SPwAnBHgLBZZvErUqbVJpgGxrdhV-Pk2YVLyFuVeO0Uew4n2PHBjjlklMu0txJcbNDA0UKR9AwdEKZSX3gEMbSZPt8DAdjdvcUPNXgvLoLo5QywWeqox-NL1_Os6U-KrsIUdhQntZgDQ238q5WpXXjyBQkResCbSLKIuvB9Fx-Mbq_7eogLtbxKQJ8j33Zg0xRlR-a2GwvvQY8rAZWRpU8d5ZF2snev-Vs_O_It2DT4kxyXQrGNtTUrAmNVQ0HYpd0EzY-JSTcgat-b6Ld55fkmtiE1KS_EFNSviaIcMmw0LRkOJlO0CpGEE_KkOVdGPe6o5uBY8srOBlitgJVi0x9lisvQoiXBnEYcIH6L8fW2EcUxQTT2ejcjLuUxYp6ys21vURFHGRSRnQP6rP5TO0DyT0heBgJP4sFS5HFMRrs3FcsVSxACNkCdzXhSWZzj-sSGC-JsUFcnmgeJZpHieVRC86qLq9l4o2_iHf0nFeEdrpb0F5xNbFL8y3xEGNRk-msBSdVMy4q7SkRMzVfIg121mW7qHfw-5cPYR3_H5QhPG2oF4ulOkJ0UqTHRiw_ADF_3MI
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLZ4HIADgwFiMCBInBAdbZM-AqcJGIMNLgyJW5U2qTQBG4Luwq_HSbOKlxC3qnWqKHbsz7FjAxxwySkXae6kaOzQQZHCETQMnVBmUh84hLE01T5vw-49u34IHmbgqLoLo5QyyWeqpR9NLF-Os4k-KjsOUdhQnmZhHu1-4JW3tSq9G0emJSn6F-gVURbZGKbn8uNB7_xCp3Gxlk8R4nvsixUybVV-6GJjYDo1uJlOrcwreWxNirSVvX-r2vjfua_AskWapF2KxirMqFEdatMuDsRu6josfSpJuAanl52hDqCfkDaxJanJ5at4JuVrghiX3BWaltwNn4foFyOMJ2XS8jrcdy4GZ13HNlhwMkRtBSoXmfosV16EIC8N4jDgAjVgjl9jH3EUE0zXo3Mz7lIWK-opN9ceExVxkEkZ0Q2YG41HahNI7gnBw0j4WSxYikyO0WXnvmKpYgGCyAa40wVPMlt9XDfBeEqMF-LyRPMo0TxKLI8acFgNeSlLb_xFvKbXvCK0y92A5pSrid2cb4mHKIuaWmcN2K8-47bSsRIxUuMJ0uBg3biLelu__3kPFrqDm37Sv7rtbcMiziUoE3qaMFe8TtQOYpUi3TUi-gHi2-AL
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=GFilter%3A+A+General+Gram+Filter+for+String+Similarity+Search&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Hu%2C+Haoji&rft.au=Zheng%2C+Kai&rft.au=Wang%2C+Xiaoling&rft.au=Zhou%2C+Aoying&rft.date=2015-04-01&rft.issn=1041-4347&rft.volume=27&rft.issue=4&rft.spage=1005&rft.epage=1018&rft_id=info:doi/10.1109%2FTKDE.2014.2349914&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon