AQUA+: Query Optimization for Hybrid Database-MapReduce System

MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge and information systems Vol. 63; no. 4; pp. 905 - 938
Main Authors	Pang, Zhifei, Wu, Sai, Huang, Haichao, Hong, Zhouzhenyan, Xie, Yuqing
Format	Journal Article
Language	English
Published	London Springer London 01.04.2021 Springer Nature B.V
Subjects	Computer Science Data analysis Data Mining and Knowledge Discovery Database Management Hybrid systems Information Storage and Retrieval Information Systems and Communication Service Information Systems Applications (incl.Internet) IT in Business Neural networks Optimization Queries Query processing Regular Paper Query Optimization Data Partition Learning to Tune MapReduce
Online Access	Get full text

Cover

Loading…

More Information
Summary:	MapReduce has been widely recognized as an efficient tool for large-scale data analysis. It achieves high performance by exploiting parallelism among processing nodes while providing a simple interface for upper-layer applications. However, there are many existing applications maintaining their data in a distributed database. It is costly to export those data into the storage system of MapReduce (normally a distributed file system). Moreover, compared to MapReduce, database is equipped with many state-of-the-art techniques, such as index and optimizer. Therefore, a hybrid Database-MapReduce system inheriting the advantages of both systems is preferred. In this paper, we propose AQUA+, a query optimizer tailored for the hybrid system. AQUA+ is an extension work of our previous system AQUA. It generates a plan that adaptively assigns the operators to the database engine and MapReduce engine to optimize the performance. The intuition is to exploit the index, co-partition and other features provided by the database as much as possible and reduce the data volume processed by the MapReduce. Due to the complexity of query optimization, in AQUA+, we introduce a novel tuning technique, learning to optimize. In particular, two neural networks are trained to predict cost and refine query plan, respectively. We train them based on our log of real query processing. Experiments carried out on our in-house cluster confirm the effectiveness of our query optimizer.
ISSN:	0219-1377 0219-3116
DOI:	10.1007/s10115-020-01542-4