Record Completeness Evaluation Based on Multiple Data Sources
Completeness is one of the central criteria for data quality. Data completeness means the completeness of the data relative to the description of the objective world, which divided into the completeness of the values and tuples. This paper examines how to use multiple data sources to evaluate the re...
Saved in:
Published in | 2019 IEEE International Conference on Power Data Science (ICPDS) pp. 109 - 112 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.11.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Completeness is one of the central criteria for data quality. Data completeness means the completeness of the data relative to the description of the objective world, which divided into the completeness of the values and tuples. This paper examines how to use multiple data sources to evaluate the record completeness of target data. However, if we want getting an accurate record completeness evaluation, we need to access all the data sources. But this will bring huge costs and is unrealistic. Therefore, this paper presents a signature-based randomized estimator for record completeness evaluation. The time to estimate record completeness is independent on the size of each data source. The basic idea of the random algorithm is to quickly estimate the record sets involved in the data sources and the target data set by linearly signing the signature for all data sources. The estimated time required is independent of the size of each data set, avoiding the huge overhead of the record pair matching. Experiments results on real data demonstrate the effectiveness and efficiency of the algorithm. |
---|---|
DOI: | 10.1109/ICPDS47662.2019.9017199 |