Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan

Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand"...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on cloud computing Vol. 12; no. 1; pp. 84 - 99
Main Authors	Weintraub, Grisha, Gudes, Ehud, Dolev, Shlomi, Ullman, Jeffrey D.
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Big Data applications Cloud computing Cloud storage Computer architecture Costs data lakes Data storage Engines Heuristic methods Mathematical analysis Measurement Queries query optimization
Online Access	Get full text
ISSN	2168-7161 2372-0018
DOI	10.1109/TCC.2023.3339208

Cover

More Information
Summary:	Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand" mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2168-7161 2372-0018
DOI:	10.1109/TCC.2023.3339208