Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan

Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand"...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on cloud computing Vol. 12; no. 1; pp. 84 - 99
Main Authors Weintraub, Grisha, Gudes, Ehud, Dolev, Shlomi, Ullman, Jeffrey D.
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.01.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN2168-7161
2372-0018
DOI10.1109/TCC.2023.3339208

Cover

Abstract Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand" mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).
AbstractList Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark).
Author Ullman, Jeffrey D.
Dolev, Shlomi
Gudes, Ehud
Weintraub, Grisha
Author_xml – sequence: 1
  givenname: Grisha
  orcidid: 0000-0003-4823-4757
  surname: Weintraub
  fullname: Weintraub, Grisha
  email: grisha.weintraub@gmail.com
  organization: Computer Science Department, Ben-Gurion University of the Negev, Beer-Sheva, Israel
– sequence: 2
  givenname: Ehud
  orcidid: 0000-0002-4805-0651
  surname: Gudes
  fullname: Gudes, Ehud
  email: ehud@cs.bgu.ac.il
  organization: Computer Science Department, Ben-Gurion University of the Negev, Beer-Sheva, Israel
– sequence: 3
  givenname: Shlomi
  orcidid: 0000-0001-5418-6670
  surname: Dolev
  fullname: Dolev, Shlomi
  email: dolev@cs.bgu.ac.il
  organization: Computer Science Department, Ben-Gurion University of the Negev, Beer-Sheva, Israel
– sequence: 4
  givenname: Jeffrey D.
  orcidid: 0000-0002-1847-3426
  surname: Ullman
  fullname: Ullman, Jeffrey D.
  email: ullman@gmail.com
  organization: Stanford University, Stanford, CA, USA
BookMark eNp9kE1LAzEQhoNUsNbePXgIeN6aZLbJ5qaun1CoQsVjyCbZmrrdrdmsoL_eLe1BPDiXGYZ55oXnGA3qpnYInVIyoZTIi0WeTxhhMAEAyUh2gIYMBEsIodmgnynPEkE5PULjtl2RvrIplVQO0eV8E_3af_t6ifOq6Sy-0VHjmX53-LlzwbsWv_r4hjW-1pWujbM4bz5d0EuHn_rFCTosddW68b6P0Mvd7SJ_SGbz-8f8apYYJllMsrJwxgoNVmQGGFiTcjYVsrSSpoYDcEM0OLCcmEIUsjQlt0UpUm6nmUsLGKHz3d9NaD4610a1arpQ95GKyVSwqZQU-iuyuzKhadvgSrUJfq3Dl6JEbVWpXpXaqlJ7VT3C_yDGRx19U8egffUfeLYDvXPuVw6kTICAH3Y1dog
CODEN ITCCF6
CitedBy_id crossref_primary_10_14778_3681954_3682013
Cites_doi 10.1007/978-1-4615-5563-6
10.1109/ICDE.2019.00196
10.1145/1721654.1721672
10.1145/3514221.3526054
10.14778/3415478.3415560
10.1145/3524284
10.14778/3476249.3476265
10.1145/320107.320109
10.1109/MCOM.2003.1222722
10.1145/2882903.2903741
10.1145/1327452.1327492
10.1145/2723372.2742797
10.1137/S0097539795294165
10.2307/j.ctt1trkk7x
10.1145/2723372.2742795
10.14778/3611479.3611507
10.1145/3318464.3389770
10.14778/3476311.3476385
10.14778/3025111.3025123
10.1145/1365815.1365816
10.1145/945445.945450
10.1145/2934664
10.14778/3611540.3611547
10.1109/MSST.2010.5496972
10.1007/BFb0022162
10.1109/BigData50022.2020.9377740
10.1145/2588555.2610515
10.14778/3352063.3352133
10.1145/3299869.3314045
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TCC.2023.3339208
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE/IET Electronic Library
CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts

Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISSN 2372-0018
EndPage 99
ExternalDocumentID 10_1109_TCC_2023_3339208
10342737
Genre orig-research
GrantInformation_xml – fundername: Data Science Research Center
– fundername: Council for Higher Education; Israeli Council for Higher Education
  funderid: 10.13039/501100005385
– fundername: Israel Data Science Initiative
– fundername: Israel Science Foundation; Israeli Science Foundation
  grantid: 465/22
  funderid: 10.13039/501100003977
– fundername: Rita Altura trust chair
GroupedDBID 0R~
4.4
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABJNI
ABQJQ
ABVLG
AGQYO
AGSQL
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
EBS
EJD
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
M43
O9-
OCL
PQQKQ
RIA
RIE
AAYXX
CITATION
RIG
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c292t-8fbecd7a3d78c323dc462579fd914c6336c0a3e3d60cb7b9fcf6dbf746d58e4b3
IEDL.DBID RIE
ISSN 2168-7161
IngestDate Mon Jun 30 04:40:48 EDT 2025
Thu Apr 24 23:03:13 EDT 2025
Tue Jul 01 02:57:19 EDT 2025
Wed Aug 27 02:12:32 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c292t-8fbecd7a3d78c323dc462579fd914c6336c0a3e3d60cb7b9fcf6dbf746d58e4b3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0001-5418-6670
0000-0002-1847-3426
0000-0003-4823-4757
0000-0002-4805-0651
PQID 2947259913
PQPubID 2040413
PageCount 16
ParticipantIDs proquest_journals_2947259913
crossref_citationtrail_10_1109_TCC_2023_3339208
ieee_primary_10342737
crossref_primary_10_1109_TCC_2023_3339208
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-Jan.-March
2024-1-00
20240101
PublicationDateYYYYMMDD 2024-01-01
PublicationDate_xml – month: 01
  year: 2024
  text: 2024-Jan.-March
PublicationDecade 2020
PublicationPlace Piscataway
PublicationPlace_xml – name: Piscataway
PublicationTitle IEEE transactions on cloud computing
PublicationTitleAbbrev TCC
PublicationYear 2024
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref13
ref12
ref15
ref14
Silberschatz (ref47) 2002; 5
ref11
(ref6) 2023
Moerkotte (ref29)
ref19
ref18
Weintraub (ref50)
(ref17) 2023
Vakharia (ref40)
ref45
(ref42) 2023
Weintraub (ref44)
(ref7) 2023
ref49
ref8
(ref36) 2023
ref9
ref4
(ref16) 2023
ref3
ref34
ref37
ref31
ref30
ref33
ref32
(ref38) 2023
(ref48) 2023
(ref41) 2023
ref2
ref1
ref39
(ref35) 2023
Armbrust (ref10)
Jain (ref22)
ref24
(ref43) 2015
ref23
ref26
(ref5) 2023
ref20
ref21
Weintraub (ref46) 2023
ref28
ref27
Ramakrishnan (ref25) 2003; 3
References_xml – volume-title: Data-lake-coverage
  year: 2023
  ident: ref46
– volume-title: Apache, Apache orc
  year: 2023
  ident: ref17
– ident: ref26
  doi: 10.1007/978-1-4615-5563-6
– ident: ref20
  doi: 10.1109/ICDE.2019.00196
– ident: ref21
  doi: 10.1145/1721654.1721672
– volume: 3
  volume-title: Database Management Systems
  year: 2003
  ident: ref25
– ident: ref39
  doi: 10.1145/3514221.3526054
– ident: ref9
  doi: 10.14778/3415478.3415560
– ident: ref13
  doi: 10.1145/3524284
– ident: ref11
  doi: 10.14778/3476249.3476265
– volume-title: Microsoft. Azure blob storage
  year: 2023
  ident: ref7
– volume-title: Amazon, AWS S3
  year: 2023
  ident: ref5
– volume-title: TPC, TPC-H
  year: 2023
  ident: ref48
– ident: ref24
  doi: 10.1145/320107.320109
– ident: ref8
  doi: 10.1109/MCOM.2003.1222722
– volume-title: Microsoft, Data lake storage query acceleration
  year: 2023
  ident: ref35
– ident: ref37
  doi: 10.1145/2882903.2903741
– volume-title: Proc. Conf. Innov. Data Syst. Res.
  ident: ref40
  article-title: Shared foundations: Modernizing metas data lakehouse
– ident: ref15
  doi: 10.1145/1327452.1327492
– ident: ref19
  doi: 10.1145/2723372.2742797
– ident: ref45
  doi: 10.1137/S0097539795294165
– ident: ref23
  doi: 10.2307/j.ctt1trkk7x
– volume-title: Apache, Apache iceberg
  year: 2023
  ident: ref41
– volume-title: Google, Google bigquery
  year: 2023
  ident: ref38
– ident: ref4
  doi: 10.1145/2723372.2742795
– volume-title: AWS, Amazon S3 select
  year: 2023
  ident: ref36
– ident: ref30
  doi: 10.14778/3611479.3611507
– volume-title: Apache, Apache parquet
  year: 2023
  ident: ref16
– start-page: 476
  volume-title: Proc. 24rd Int. Conf. Very Large Data Bases
  ident: ref29
  article-title: Small materialized aggregates: A light weight index structure for data warehousing
– ident: ref34
  doi: 10.1145/3318464.3389770
– volume: 5
  volume-title: Database System Concepts
  year: 2002
  ident: ref47
– ident: ref28
  doi: 10.14778/3476311.3476385
– ident: ref33
  doi: 10.14778/3025111.3025123
– ident: ref2
  doi: 10.1145/1365815.1365816
– volume-title: Apache, Apache hudi
  year: 2023
  ident: ref42
– volume-title: Amazon, Building and maintaining an amazon S3 metadata index without servers
  year: 2015
  ident: ref43
– ident: ref1
  doi: 10.1145/945445.945450
– volume-title: Proc. Annu. Conf. Innov. Data Syst. Res.
  ident: ref10
  article-title: Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics
– start-page: 13
  volume-title: Proc. Conf. Very Large Data Bases
  ident: ref50
  article-title: Optimizing cloud data lakes queries
– ident: ref14
  doi: 10.1145/2934664
– volume-title: Google, Google cloud storage
  year: 2023
  ident: ref6
– volume-title: Proc. Conf. Innov. Data Syst. Res.
  ident: ref22
  article-title: Analyzing and comparing lakehouse storage systems
– ident: ref27
  doi: 10.14778/3611540.3611547
– ident: ref3
  doi: 10.1109/MSST.2010.5496972
– ident: ref49
  doi: 10.1007/BFb0022162
– volume-title: Proc. Int. Conf. Extending Database Technol./Database Theory Workshops
  ident: ref44
  article-title: Needle in a haystack queries in cloud data lakes
– ident: ref31
  doi: 10.1109/BigData50022.2020.9377740
– ident: ref32
  doi: 10.1145/2588555.2610515
– ident: ref12
  doi: 10.14778/3352063.3352133
– ident: ref18
  doi: 10.1145/3299869.3314045
SSID ssj0000851919
Score 2.3026516
Snippet Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus,...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 84
SubjectTerms Big Data applications
Cloud computing
Cloud storage
Computer architecture
Costs
data lakes
Data storage
Engines
Heuristic methods
Mathematical analysis
Measurement
Queries
query optimization
Title Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan
URI https://ieeexplore.ieee.org/document/10342737
https://www.proquest.com/docview/2947259913
Volume 12
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8QwEB50T158i6ur5ODFQ-u2SZPOTa2KiA8ERW-leaGs7oq2l_31Jn0soijeekhKmPckM_MB7Dmbx5RmGKCJkoCxRARYOH3U1loPO50Y4Rucr675-T27eEwe22b1uhfGGFMXn5nQf9Zv-XqiKn9V5jScMuduxTzMOzlrmrVmFyo-dsAIu6fIIR7cZVno0cFDSl0U4AEkv7ieGkvlhwGuvcrZElx352mKSUZhVcpQTb-Navz3gZdhsY0vyVEjECswZ8arsNRhN5BWldfg8MbZitfnqfNcJHuZVJqcFGVBLouRIbeVn378QR6eyydSkGNf_aiMJpkv93T2h3iko3W4Pzu9y86DFk0hUDHGZZBaxy4tCqpFqmhMtWIu9xFoNUZMcUq5GhbUUM2HSgqJVlmupRWM6yQ1TNIN6I0nY7MJRPgZ9Jqin-XOJEcpU5cYoUzQ8pSbog8HHaFz1Y4a94gXL3mdcgwxd6zJPWvyljV92J_teGvGbPyxdt1T-su6hsh9GHTMzFtF_MhjZMJleBjRrV-2bcOC-ztrrlUG0CvfK7PjAo1S7tYC9gnOPs3d
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwED7xGGDhjShPDywMCW3t2LkNCKACpQipCLYofgkEtAiShV-PnQdCIBBbBlux7nxP390HsOt0HlOaYYCmEwWMRSLAzMmjttZ62OnICN_gfDngvRt2fhfd1c3qZS-MMaYsPjOh_yzf8vVYFT5V5iScMmduxSRMO8PPoqpd6zOl4r0H7GDzGNnG_WGShB4fPKTU-QEeQvKL8SnRVH6o4NKunM7DoDlRVU7yGBa5DNX7t2GN_z7yAszVHiY5rK7EIkyY0RLMN-gNpBbmZTi4ctri-eHd2S6SPI0LTY6zPCP97NGQ68LPP34jtw_5PcnIka9_VEaTxBd8Og1EPNbRCtycngyTXlDjKQSqi908iK1jmBYZ1SJWtEu1Yi76EWg1dpjilHLVzqihmreVFBKtslxLKxjXUWyYpKswNRqPzBoQ4afQa4p-mjuTHKWMXWiEMkLLY26yFuw3hE5VPWzcY148pWXQ0cbUsSb1rElr1rRg73PHSzVo44-1K57SX9ZVRG7BZsPMtBbFt7SLTLgYDzt0_ZdtOzDTG1720_7Z4GIDZt2fWJVk2YSp_LUwW87tyOV2edk-AMJb0So
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Optimizing+Cloud+Data+Lake+Queries+With+a+Balanced+Coverage+Plan&rft.jtitle=IEEE+transactions+on+cloud+computing&rft.au=Weintraub%2C+Grisha&rft.au=Gudes%2C+Ehud&rft.au=Dolev%2C+Shlomi&rft.au=Ullman%2C+Jeffrey+D&rft.date=2024-01-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.eissn=2372-0018&rft.volume=12&rft.issue=1&rft.spage=84&rft_id=info:doi/10.1109%2FTCC.2023.3339208&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2168-7161&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2168-7161&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2168-7161&client=summon