GFilter: A General Gram Filter for String Similarity Search
Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework t...
Saved in:
Published in | IEEE transactions on knowledge and data engineering Vol. 27; no. 4; pp. 1005 - 1018 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.04.2015
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches. |
---|---|
AbstractList | Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to efficiently find all the similar answers from a large scale string collection. Many existing methods adopt a prefix-filter-based framework to solve this problem, and a number of recent works aim to use advanced filters to improve the overall search performance. In this paper, we propose a gram-based framework to achieve near maximum filter performance. The main idea is to judiciously choose the high-quality grams as the prefix of query according to their estimated ability to filter candidates. As this selection process is proved to be NP-hard problem, we give a cost model to measure the filter ability of grams and develop efficient heuristic algorithms to find high-quality grams. Extensive experiments on real datasets demonstrate the superiority of the proposed framework in comparison with the state-of-art approaches. |
Author | Haoji Hu Xiaoling Wang Aoying Zhou Kai Zheng |
Author_xml | – sequence: 1 givenname: Haoji surname: Hu fullname: Hu, Haoji – sequence: 2 givenname: Kai surname: Zheng fullname: Zheng, Kai – sequence: 3 givenname: Xiaoling surname: Wang fullname: Wang, Xiaoling – sequence: 4 givenname: Aoying surname: Zhou fullname: Zhou, Aoying |
BookMark | eNp9kEFPwkAQhTcGEwH9AcZLEy9eijOd3XZXTwShGkk8gOfNUra6pLS4LQf-vSUlHjh4msnM915e3oD1yqq0jN0ijBBBPS7fX6ajCJCPIuJKIb9gfRRChhEq7LU7cAw58eSKDep6AwAykdhnz-nMFY31T8E4SG1pvSmC1Jtt0J2DvPLBovGu_AoWbusK411zCBbW-Oz7ml3mpqjtzWkO2edsupy8hvOP9G0ynocZRXETUrJeRTy3mBDCSshYKIOC8vYrI06CGx6DEpApIC4toYWcizgiI0W2Xic0ZA-d785XP3tbN3rr6swWhSltta81Jqq1AiJs0fszdFPtfdmm0xjHRMAJREslHZX5qq69zXXmGtO4qmy8cYVG0MdS9bFUfSxVn0ptlXim3Hm3Nf7wr-au0zhr7R8fSwltbvoF3N-APw |
CODEN | ITKEEH |
CitedBy_id | crossref_primary_10_1016_j_eswa_2020_113403 crossref_primary_10_1007_s10489_020_01778_1 crossref_primary_10_3233_IDA_216325 crossref_primary_10_1016_j_comcom_2019_06_011 crossref_primary_10_1007_s10489_019_01616_z crossref_primary_10_1109_ACCESS_2018_2832209 |
Cites_doi | 10.14778/1687627.1687630 10.14778/1920841.1920938 10.1145/1807167.1807266 10.1145/509961.509965 10.1109/ICDE.2008.4497435 10.1016/S0169-7552(97)00031-7 10.1145/1376616.1376655 10.14778/1920841.1920992 10.1109/ICDE.2012.68 10.1145/1989323.1989431 10.1109/ICDE.2006.9 10.1109/ICDE.2008.4497434 10.1109/TKDE.2012.79 10.1145/1559845.1559919 10.1145/1367497.1367516 10.14778/1453856.1453957 10.1145/1559845.1559925 10.1145/2213836.2213847 10.1145/1242572.1242591 10.1145/1242524.1242529 10.1145/375360.375365 10.14778/1978665.1978666 10.1145/872757.872770 10.14778/2078331.2078340 10.1006/jmbi.1990.9999 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2015 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2015 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D F28 FR3 |
DOI | 10.1109/TKDE.2014.2349914 |
DatabaseName | IEEE Xplore (IEEE) IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ANTE: Abstracts in New Technology & Engineering Engineering Research Database |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional Engineering Research Database ANTE: Abstracts in New Technology & Engineering |
DatabaseTitleList | Technology Research Database Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering Computer Science |
EISSN | 1558-2191 |
EndPage | 1018 |
ExternalDocumentID | 3623503401 10_1109_TKDE_2014_2349914 6880793 |
Genre | orig-research |
GrantInformation_xml | – fundername: NSFC grantid: 61033007; 61170085; 61021004 funderid: 10.13039/501100001809 – fundername: Shanghai Knowledge Service Platform grantid: ZF1213 – fundername: 973 project grantid: 2010CB328106 – fundername: Shanghai Leading Academic Discipline grantid: B412 |
GroupedDBID | -~X .DC 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK AENEX AGQYO AGSQL AHBIQ AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD F5P HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS RXW TAE TN5 UHB AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D F28 FR3 |
ID | FETCH-LOGICAL-c326t-37db24fe17310b58659a153fc32824354a460950c90348e31e0f45623a85cdd73 |
IEDL.DBID | RIE |
ISSN | 1041-4347 |
IngestDate | Thu Jul 10 16:42:14 EDT 2025 Mon Jun 30 03:02:43 EDT 2025 Thu Apr 24 23:07:42 EDT 2025 Tue Jul 01 03:14:35 EDT 2025 Wed Aug 27 02:52:16 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 4 |
Keywords | gram-based framework Data integration similarity search |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c326t-37db24fe17310b58659a153fc32824354a460950c90348e31e0f45623a85cdd73 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
PQID | 1663304305 |
PQPubID | 85438 |
PageCount | 14 |
ParticipantIDs | crossref_primary_10_1109_TKDE_2014_2349914 proquest_miscellaneous_1793280331 proquest_journals_1663304305 ieee_primary_6880793 crossref_citationtrail_10_1109_TKDE_2014_2349914 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2015-April-1 2015-4-1 20150401 |
PublicationDateYYYYMMDD | 2015-04-01 |
PublicationDate_xml | – month: 04 year: 2015 text: 2015-April-1 day: 01 |
PublicationDecade | 2010 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE transactions on knowledge and data engineering |
PublicationTitleAbbrev | TKDE |
PublicationYear | 2015 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref12 ref15 gravano (ref1) 2001 ref14 ref31 ref10 arasu (ref16) 2006 ref2 ref17 ref18 bocek (ref20) 2007 deng (ref30) 2013 ref24 ref23 ref26 ref25 ref22 ref21 broder (ref11) 1997 ref28 ref27 wang (ref19) 2012; 25 ref29 ref7 ref9 ref4 ref3 ref6 ref5 li (ref8) 2007 |
References_xml | – ident: ref14 doi: 10.14778/1687627.1687630 – ident: ref15 doi: 10.14778/1920841.1920938 – start-page: 925 year: 2013 ident: ref30 article-title: Top-k string similarity search with edit-distance constraints publication-title: Proc IEEE 29th Int Conf Data Eng – ident: ref17 doi: 10.1145/1807167.1807266 – ident: ref28 doi: 10.1145/509961.509965 – ident: ref2 doi: 10.1109/ICDE.2008.4497435 – ident: ref27 doi: 10.1016/S0169-7552(97)00031-7 – ident: ref9 doi: 10.1145/1376616.1376655 – start-page: 21 year: 1997 ident: ref11 article-title: On the resemblance and containment of documents publication-title: Proc Compression Complexity Sequences – start-page: 918 year: 2006 ident: ref16 article-title: Efficient exact set-similarity joins publication-title: Proc Int Conf Very Large Data Bases – ident: ref23 doi: 10.14778/1920841.1920992 – ident: ref24 doi: 10.1109/ICDE.2012.68 – ident: ref6 doi: 10.1145/1989323.1989431 – ident: ref5 doi: 10.1109/ICDE.2006.9 – ident: ref31 doi: 10.1109/ICDE.2008.4497434 – volume: 25 start-page: 1916 year: 2012 ident: ref19 article-title: VChunkJoin: An efficient algorithm for edit similarity joins publication-title: IEEE Trans Knowl Data Eng doi: 10.1109/TKDE.2012.79 – start-page: 491 year: 2001 ident: ref1 article-title: Approximate string joins in a database (almost) for free publication-title: Proc 27th Int Conf Very Large Data Bases – ident: ref22 doi: 10.1145/1559845.1559919 – ident: ref10 doi: 10.1145/1367497.1367516 – ident: ref4 doi: 10.14778/1453856.1453957 – ident: ref21 doi: 10.1145/1559845.1559925 – ident: ref7 doi: 10.1145/2213836.2213847 – ident: ref13 doi: 10.1145/1242572.1242591 – year: 2007 ident: ref20 article-title: Fast similarity search in large dictionaries – ident: ref25 doi: 10.1145/1242524.1242529 – ident: ref12 doi: 10.1145/375360.375365 – ident: ref3 doi: 10.14778/1978665.1978666 – start-page: 303 year: 2007 ident: ref8 article-title: VGRAM: Improving performance of approximate queries on string collections using variable length grams publication-title: Proc Int Conf Very Large Data Bases – ident: ref18 doi: 10.1145/872757.872770 – ident: ref29 doi: 10.14778/2078331.2078340 – ident: ref26 doi: 10.1006/jmbi.1990.9999 |
SSID | ssj0008781 |
Score | 2.1731646 |
Snippet | Numerous applications such as data integration, protein detection, and article copy detection share a similar core problem: given a string as the query, how to... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 1005 |
SubjectTerms | Collection Data integration Educational institutions Greedy algorithms Heuristic Heuristic methods Indexes Proteins Query processing Radiation detectors Search problems Searching Similarity Strings |
Title | GFilter: A General Gram Filter for String Similarity Search |
URI | https://ieeexplore.ieee.org/document/6880793 https://www.proquest.com/docview/1663304305 https://www.proquest.com/docview/1793280331 |
Volume | 27 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLbGTnBgMEAMBgoSJ0RH26SPwAnBHgLBZZvErUqbVJpgGxrdhV-Pk2YVLyFuVeO0Uew4n2PHBjjlklMu0txJcbNDA0UKR9AwdEKZSX3gEMbSZPt8DAdjdvcUPNXgvLoLo5QywWeqox-NL1_Os6U-KrsIUdhQntZgDQ238q5WpXXjyBQkResCbSLKIuvB9Fx-Mbq_7eogLtbxKQJ8j33Zg0xRlR-a2GwvvQY8rAZWRpU8d5ZF2snev-Vs_O_It2DT4kxyXQrGNtTUrAmNVQ0HYpd0EzY-JSTcgat-b6Ld55fkmtiE1KS_EFNSviaIcMmw0LRkOJlO0CpGEE_KkOVdGPe6o5uBY8srOBlitgJVi0x9lisvQoiXBnEYcIH6L8fW2EcUxQTT2ejcjLuUxYp6ys21vURFHGRSRnQP6rP5TO0DyT0heBgJP4sFS5HFMRrs3FcsVSxACNkCdzXhSWZzj-sSGC-JsUFcnmgeJZpHieVRC86qLq9l4o2_iHf0nFeEdrpb0F5xNbFL8y3xEGNRk-msBSdVMy4q7SkRMzVfIg121mW7qHfw-5cPYR3_H5QhPG2oF4ulOkJ0UqTHRiw_ADF_3MI |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDLZ4HIADgwFiMCBInBAdbZM-AqcJGIMNLgyJW5U2qTQBG4Luwq_HSbOKlxC3qnWqKHbsz7FjAxxwySkXae6kaOzQQZHCETQMnVBmUh84hLE01T5vw-49u34IHmbgqLoLo5QyyWeqpR9NLF-Os4k-KjsOUdhQnmZhHu1-4JW3tSq9G0emJSn6F-gVURbZGKbn8uNB7_xCp3Gxlk8R4nvsixUybVV-6GJjYDo1uJlOrcwreWxNirSVvX-r2vjfua_AskWapF2KxirMqFEdatMuDsRu6josfSpJuAanl52hDqCfkDaxJanJ5at4JuVrghiX3BWaltwNn4foFyOMJ2XS8jrcdy4GZ13HNlhwMkRtBSoXmfosV16EIC8N4jDgAjVgjl9jH3EUE0zXo3Mz7lIWK-opN9ceExVxkEkZ0Q2YG41HahNI7gnBw0j4WSxYikyO0WXnvmKpYgGCyAa40wVPMlt9XDfBeEqMF-LyRPMo0TxKLI8acFgNeSlLb_xFvKbXvCK0y92A5pSrid2cb4mHKIuaWmcN2K8-47bSsRIxUuMJ0uBg3biLelu__3kPFrqDm37Sv7rtbcMiziUoE3qaMFe8TtQOYpUi3TUi-gHi2-AL |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=GFilter%3A+A+General+Gram+Filter+for+String+Similarity+Search&rft.jtitle=IEEE+transactions+on+knowledge+and+data+engineering&rft.au=Hu%2C+Haoji&rft.au=Zheng%2C+Kai&rft.au=Wang%2C+Xiaoling&rft.au=Zhou%2C+Aoying&rft.date=2015-04-01&rft.issn=1041-4347&rft.volume=27&rft.issue=4&rft.spage=1005&rft.epage=1018&rft_id=info:doi/10.1109%2FTKDE.2014.2349914&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1041-4347&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1041-4347&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1041-4347&client=summon |