Robust identification of fuzzy duplicates

Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more...

Full description

Saved in:

Bibliographic Details
Published in	21st International Conference on Data Engineering (ICDE'05) pp. 865 - 876
Main Authors	Chaudhuri, S., Ganti, V., Motwani, R.
Format	Conference Proceeding
Language	English
Published	IEEE 2005
Subjects	Cleaning Clustering algorithms Costs Couplings Data mining Partitioning algorithms Robustness Scalability Training data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Detecting and eliminating fuzzy duplicates is a critical data cleaning task that is required by many applications. Fuzzy duplicates are multiple seemingly distinct tuples, which represent the same real-world entity. We propose two novel criteria that enable characterization of fuzzy duplicates more accurately than is possible with existing techniques. Using these criteria, we propose a novel framework for the fuzzy duplicate elimination problem. We show that solutions within the new framework result in better accuracy than earlier approaches. We present an efficient algorithm for solving instantiations within the framework. We evaluate it on real datasets to demonstrate the accuracy and scalability of our algorithm.
ISBN:	0769522858 9780769522852
ISSN:	1063-6382 2375-026X
DOI:	10.1109/ICDE.2005.125