Efficiently processing deterministic approximate aggregation query on massive data
In actual applications, aggregation is an important operation to return statistical characterizations of subset of the data set. On massive data, approximate aggregation often is preferable for its better timeliness and responsiveness. This paper focuses on deterministic approximate aggregation to r...
Saved in:
Published in | Knowledge and information systems Vol. 57; no. 2; pp. 437 - 473 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
London
Springer London
01.11.2018
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In actual applications, aggregation is an important operation to return statistical characterizations of subset of the data set. On massive data, approximate aggregation often is preferable for its better timeliness and responsiveness. This paper focuses on deterministic approximate aggregation to return running aggregate within progressive deterministic error interval. The existing methods either return approximate results with fixed accuracy, or return online approximate aggregate with probabilistic confidence interval, or incur a high I/O cost on massive data. This paper proposes LDA algorithm to compute deterministic approximate aggregate on massive data efficiently. LDA utilizes selection attribute lattice of hierarchical structure to distribute tuples and obtain a horizontal partitioning of the table. In each partition, each selection attribute is kept in column file and each ranking attribute is transposed to bit-slices. Given the selection condition, only relevant partitions are involved to compute the running aggregate. The compact storage scheme based on Z-order space filling curve is proposed to reduce the management cost of the partitions. An error reduction method is devised to reduce the error interval when computing running aggregate. The extensive experimental results on synthetic and real data sets show that LDA has a significant performance advantage over the existing algorithms. |
---|---|
ISSN: | 0219-1377 0219-3116 |
DOI: | 10.1007/s10115-017-1136-z |