AQUA+: Query Optimization for Hybrid Database-MapReduce System

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data...

Full description

Saved in:
Bibliographic Details
Published inKnowledge and information systems Vol. 63; no. 4; pp. 905 - 938
Main Authors Pang, Zhifei, Wu, Sai, Huang, Haichao, Hong, Zhouzhenyan, Xie, Yuqing
Format Journal Article
LanguageEnglish
Published London Springer London 01.04.2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data in a distributed database. It is costly to export those data into the storage system of MapReduce (normally a distributed file system). Moreover, compared to MapReduce, database is equipped with many state-of-the-art techniques, such as index and optimizer. Therefore, a hybrid Database-MapReduce system inheriting the advantages of both systems is preferred. In this paper, we propose AQUA+, a query optimizer tailored for the hybrid system. AQUA+ is an extension work of our previous system AQUA. It generates a plan that adaptively assigns the operators to the database engine and MapReduce engine to optimize the performance. The intuition is to exploit the index, co-partition and other features provided by the database as much as possible and reduce the data volume processed by the MapReduce. Due to the complexity of query optimization, in AQUA+, we introduce a novel tuning technique, learning to optimize. In particular, two neural networks are trained to predict cost and refine query plan, respectively. We train them based on our log of real query processing. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.
ISSN:0219-1377
0219-3116
DOI:10.1007/s10115-020-01542-4