Interval Estimation for Aggregate Queries on Incomplete Data

Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 34; no. 6; pp. 1203 - 1216
Main Authors	Zhang, An-Zhen, Li, Jian-Zhong, Gao, Hong
Format	Journal Article
Language	English
Published	New York Springer US 01.11.2019 Springer Springer Nature B.V School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001, China
Subjects	Analysis Artificial Intelligence Computer Science Data Structures and Information Theory End users Information management Information Systems Applications (incl.Internet) Lower bounds Queries Regular Paper Semantics Software Engineering Theory of Computation aggregate query data quality interval estimation incomplete data
Online Access	Get full text
ISSN	1000-9000 1860-4749
DOI	10.1007/s11390-019-1970-4

Cover

More Information
Summary:	Incomplete data has been a longstanding issue in the database community, and the subject is yet poorly handled by both theories and practices. One common way to cope with missing values is to complete their imputation (filling in) as a preprocessing step before analyses. Unfortunately, not a single imputation method could impute all missing values correctly in all cases. Users could hardly trust the query result on such complete data without any confidence guarantee. In this paper, we propose to directly estimate the aggregate query result on incomplete data, rather than to impute the missing values. An interval estimation, composed of the upper and the lower bound of aggregate query results among all possible interpretations of missing values, is presented to the end users. The ground-truth aggregate result is guaranteed to be among the interval. We believe that decision support applications could benefit significantly from the estimation, since they can tolerate inexact answers, as long as there are clearly defined semantics and guarantees associated with the results. Our main techniques are parameter-free and do not assume prior knowledge about the distribution and missingness mechanisms. Experimental results are consistent with the theoretical results and suggest that the estimation is invaluable to better assess the results of aggregate queries on incomplete data.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-019-1970-4