Record Completeness Evaluation Based on Multiple Data Sources

Completeness is one of the central criteria for data quality. Data completeness means the completeness of the data relative to the description of the objective world, which divided into the completeness of the values and tuples. This paper examines how to use multiple data sources to evaluate the re...

Full description

Saved in:
Bibliographic Details
Published in2019 IEEE International Conference on Power Data Science (ICPDS) pp. 109 - 112
Main Authors Wu, Aman, Li, LingLi, Xuan, Ping
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2019
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Completeness is one of the central criteria for data quality. Data completeness means the completeness of the data relative to the description of the objective world, which divided into the completeness of the values and tuples. This paper examines how to use multiple data sources to evaluate the record completeness of target data. However, if we want getting an accurate record completeness evaluation, we need to access all the data sources. But this will bring huge costs and is unrealistic. Therefore, this paper presents a signature-based randomized estimator for record completeness evaluation. The time to estimate record completeness is independent on the size of each data source. The basic idea of the random algorithm is to quickly estimate the record sets involved in the data sources and the target data set by linearly signing the signature for all data sources. The estimated time required is independent of the size of each data set, avoiding the huge overhead of the record pair matching. Experiments results on real data demonstrate the effectiveness and efficiency of the algorithm.
DOI:10.1109/ICPDS47662.2019.9017199