Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan
Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand"...
Saved in:
Published in | IEEE transactions on cloud computing Vol. 12; no. 1; pp. 84 - 99 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Piscataway
IEEE
01.01.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
ISSN | 2168-7161 2372-0018 |
DOI | 10.1109/TCC.2023.3339208 |
Cover
Abstract | Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in "on-demand" mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark). |
---|---|
AbstractList | Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus, cheap cloud storage is used for storing the data, while compute engines are used for running analytics on this data in “on-demand” mode. However, to perform any computation on the data in this architecture, the data should be moved from the storage layer to the compute layer over the network for each calculation. Obviously, that hurts calculation performance and requires huge network bandwidth. In this paper, we study different approaches to improve query performance in a data lake architecture. We define an optimization problem that can provably speed up data lake queries. We prove that the problem is NP-hard and suggest heuristic approaches. Then, we demonstrate through the experiments that our approach is feasible and efficient (up to ×30 query execution time improvement based on the TPC-H benchmark). |
Author | Ullman, Jeffrey D. Dolev, Shlomi Gudes, Ehud Weintraub, Grisha |
Author_xml | – sequence: 1 givenname: Grisha orcidid: 0000-0003-4823-4757 surname: Weintraub fullname: Weintraub, Grisha email: grisha.weintraub@gmail.com organization: Computer Science Department, Ben-Gurion University of the Negev, Beer-Sheva, Israel – sequence: 2 givenname: Ehud orcidid: 0000-0002-4805-0651 surname: Gudes fullname: Gudes, Ehud email: ehud@cs.bgu.ac.il organization: Computer Science Department, Ben-Gurion University of the Negev, Beer-Sheva, Israel – sequence: 3 givenname: Shlomi orcidid: 0000-0001-5418-6670 surname: Dolev fullname: Dolev, Shlomi email: dolev@cs.bgu.ac.il organization: Computer Science Department, Ben-Gurion University of the Negev, Beer-Sheva, Israel – sequence: 4 givenname: Jeffrey D. orcidid: 0000-0002-1847-3426 surname: Ullman fullname: Ullman, Jeffrey D. email: ullman@gmail.com organization: Stanford University, Stanford, CA, USA |
BookMark | eNp9kE1LAzEQhoNUsNbePXgIeN6aZLbJ5qaun1CoQsVjyCbZmrrdrdmsoL_eLe1BPDiXGYZ55oXnGA3qpnYInVIyoZTIi0WeTxhhMAEAyUh2gIYMBEsIodmgnynPEkE5PULjtl2RvrIplVQO0eV8E_3af_t6ifOq6Sy-0VHjmX53-LlzwbsWv_r4hjW-1pWujbM4bz5d0EuHn_rFCTosddW68b6P0Mvd7SJ_SGbz-8f8apYYJllMsrJwxgoNVmQGGFiTcjYVsrSSpoYDcEM0OLCcmEIUsjQlt0UpUm6nmUsLGKHz3d9NaD4610a1arpQ95GKyVSwqZQU-iuyuzKhadvgSrUJfq3Dl6JEbVWpXpXaqlJ7VT3C_yDGRx19U8egffUfeLYDvXPuVw6kTICAH3Y1dog |
CODEN | ITCCF6 |
CitedBy_id | crossref_primary_10_14778_3681954_3682013 |
Cites_doi | 10.1007/978-1-4615-5563-6 10.1109/ICDE.2019.00196 10.1145/1721654.1721672 10.1145/3514221.3526054 10.14778/3415478.3415560 10.1145/3524284 10.14778/3476249.3476265 10.1145/320107.320109 10.1109/MCOM.2003.1222722 10.1145/2882903.2903741 10.1145/1327452.1327492 10.1145/2723372.2742797 10.1137/S0097539795294165 10.2307/j.ctt1trkk7x 10.1145/2723372.2742795 10.14778/3611479.3611507 10.1145/3318464.3389770 10.14778/3476311.3476385 10.14778/3025111.3025123 10.1145/1365815.1365816 10.1145/945445.945450 10.1145/2934664 10.14778/3611540.3611547 10.1109/MSST.2010.5496972 10.1007/BFb0022162 10.1109/BigData50022.2020.9377740 10.1145/2588555.2610515 10.14778/3352063.3352133 10.1145/3299869.3314045 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2024 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/TCC.2023.3339208 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE/IET Electronic Library CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 2372-0018 |
EndPage | 99 |
ExternalDocumentID | 10_1109_TCC_2023_3339208 10342737 |
Genre | orig-research |
GrantInformation_xml | – fundername: Data Science Research Center – fundername: Council for Higher Education; Israeli Council for Higher Education funderid: 10.13039/501100005385 – fundername: Israel Data Science Initiative – fundername: Israel Science Foundation; Israeli Science Foundation grantid: 465/22 funderid: 10.13039/501100003977 – fundername: Rita Altura trust chair |
GroupedDBID | 0R~ 4.4 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABJNI ABQJQ ABVLG AGQYO AGSQL AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ EBS EJD HZ~ IEDLZ IFIPE IPLJI JAVBF M43 O9- OCL PQQKQ RIA RIE AAYXX CITATION RIG 7SC 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c292t-8fbecd7a3d78c323dc462579fd914c6336c0a3e3d60cb7b9fcf6dbf746d58e4b3 |
IEDL.DBID | RIE |
ISSN | 2168-7161 |
IngestDate | Mon Jun 30 04:40:48 EDT 2025 Thu Apr 24 23:03:13 EDT 2025 Tue Jul 01 02:57:19 EDT 2025 Wed Aug 27 02:12:32 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 1 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c292t-8fbecd7a3d78c323dc462579fd914c6336c0a3e3d60cb7b9fcf6dbf746d58e4b3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0001-5418-6670 0000-0002-1847-3426 0000-0003-4823-4757 0000-0002-4805-0651 |
PQID | 2947259913 |
PQPubID | 2040413 |
PageCount | 16 |
ParticipantIDs | proquest_journals_2947259913 crossref_citationtrail_10_1109_TCC_2023_3339208 ieee_primary_10342737 crossref_primary_10_1109_TCC_2023_3339208 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2024-Jan.-March 2024-1-00 20240101 |
PublicationDateYYYYMMDD | 2024-01-01 |
PublicationDate_xml | – month: 01 year: 2024 text: 2024-Jan.-March |
PublicationDecade | 2020 |
PublicationPlace | Piscataway |
PublicationPlace_xml | – name: Piscataway |
PublicationTitle | IEEE transactions on cloud computing |
PublicationTitleAbbrev | TCC |
PublicationYear | 2024 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref13 ref12 ref15 ref14 Silberschatz (ref47) 2002; 5 ref11 (ref6) 2023 Moerkotte (ref29) ref19 ref18 Weintraub (ref50) (ref17) 2023 Vakharia (ref40) ref45 (ref42) 2023 Weintraub (ref44) (ref7) 2023 ref49 ref8 (ref36) 2023 ref9 ref4 (ref16) 2023 ref3 ref34 ref37 ref31 ref30 ref33 ref32 (ref38) 2023 (ref48) 2023 (ref41) 2023 ref2 ref1 ref39 (ref35) 2023 Armbrust (ref10) Jain (ref22) ref24 (ref43) 2015 ref23 ref26 (ref5) 2023 ref20 ref21 Weintraub (ref46) 2023 ref28 ref27 Ramakrishnan (ref25) 2003; 3 |
References_xml | – volume-title: Data-lake-coverage year: 2023 ident: ref46 – volume-title: Apache, Apache orc year: 2023 ident: ref17 – ident: ref26 doi: 10.1007/978-1-4615-5563-6 – ident: ref20 doi: 10.1109/ICDE.2019.00196 – ident: ref21 doi: 10.1145/1721654.1721672 – volume: 3 volume-title: Database Management Systems year: 2003 ident: ref25 – ident: ref39 doi: 10.1145/3514221.3526054 – ident: ref9 doi: 10.14778/3415478.3415560 – ident: ref13 doi: 10.1145/3524284 – ident: ref11 doi: 10.14778/3476249.3476265 – volume-title: Microsoft. Azure blob storage year: 2023 ident: ref7 – volume-title: Amazon, AWS S3 year: 2023 ident: ref5 – volume-title: TPC, TPC-H year: 2023 ident: ref48 – ident: ref24 doi: 10.1145/320107.320109 – ident: ref8 doi: 10.1109/MCOM.2003.1222722 – volume-title: Microsoft, Data lake storage query acceleration year: 2023 ident: ref35 – ident: ref37 doi: 10.1145/2882903.2903741 – volume-title: Proc. Conf. Innov. Data Syst. Res. ident: ref40 article-title: Shared foundations: Modernizing metas data lakehouse – ident: ref15 doi: 10.1145/1327452.1327492 – ident: ref19 doi: 10.1145/2723372.2742797 – ident: ref45 doi: 10.1137/S0097539795294165 – ident: ref23 doi: 10.2307/j.ctt1trkk7x – volume-title: Apache, Apache iceberg year: 2023 ident: ref41 – volume-title: Google, Google bigquery year: 2023 ident: ref38 – ident: ref4 doi: 10.1145/2723372.2742795 – volume-title: AWS, Amazon S3 select year: 2023 ident: ref36 – ident: ref30 doi: 10.14778/3611479.3611507 – volume-title: Apache, Apache parquet year: 2023 ident: ref16 – start-page: 476 volume-title: Proc. 24rd Int. Conf. Very Large Data Bases ident: ref29 article-title: Small materialized aggregates: A light weight index structure for data warehousing – ident: ref34 doi: 10.1145/3318464.3389770 – volume: 5 volume-title: Database System Concepts year: 2002 ident: ref47 – ident: ref28 doi: 10.14778/3476311.3476385 – ident: ref33 doi: 10.14778/3025111.3025123 – ident: ref2 doi: 10.1145/1365815.1365816 – volume-title: Apache, Apache hudi year: 2023 ident: ref42 – volume-title: Amazon, Building and maintaining an amazon S3 metadata index without servers year: 2015 ident: ref43 – ident: ref1 doi: 10.1145/945445.945450 – volume-title: Proc. Annu. Conf. Innov. Data Syst. Res. ident: ref10 article-title: Lakehouse: A new generation of open platforms that unify data warehousing and advanced analytics – start-page: 13 volume-title: Proc. Conf. Very Large Data Bases ident: ref50 article-title: Optimizing cloud data lakes queries – ident: ref14 doi: 10.1145/2934664 – volume-title: Google, Google cloud storage year: 2023 ident: ref6 – volume-title: Proc. Conf. Innov. Data Syst. Res. ident: ref22 article-title: Analyzing and comparing lakehouse storage systems – ident: ref27 doi: 10.14778/3611540.3611547 – ident: ref3 doi: 10.1109/MSST.2010.5496972 – ident: ref49 doi: 10.1007/BFb0022162 – volume-title: Proc. Int. Conf. Extending Database Technol./Database Theory Workshops ident: ref44 article-title: Needle in a haystack queries in cloud data lakes – ident: ref31 doi: 10.1109/BigData50022.2020.9377740 – ident: ref32 doi: 10.1145/2588555.2610515 – ident: ref12 doi: 10.14778/3352063.3352133 – ident: ref18 doi: 10.1145/3299869.3314045 |
SSID | ssj0000851919 |
Score | 2.3026516 |
Snippet | Cloud data lakes emerge as an inexpensive solution for storing very large amounts of data. The main idea is the separation of compute and storage layers. Thus,... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 84 |
SubjectTerms | Big Data applications Cloud computing Cloud storage Computer architecture Costs data lakes Data storage Engines Heuristic methods Mathematical analysis Measurement Queries query optimization |
Title | Optimizing Cloud Data Lake Queries With a Balanced Coverage Plan |
URI | https://ieeexplore.ieee.org/document/10342737 https://www.proquest.com/docview/2947259913 |
Volume | 12 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8QwEB50T158i6ur5ODFQ-u2SZPOTa2KiA8ERW-leaGs7oq2l_31Jn0soijeekhKmPckM_MB7Dmbx5RmGKCJkoCxRARYOH3U1loPO50Y4Rucr675-T27eEwe22b1uhfGGFMXn5nQf9Zv-XqiKn9V5jScMuduxTzMOzlrmrVmFyo-dsAIu6fIIR7cZVno0cFDSl0U4AEkv7ieGkvlhwGuvcrZElx352mKSUZhVcpQTb-Navz3gZdhsY0vyVEjECswZ8arsNRhN5BWldfg8MbZitfnqfNcJHuZVJqcFGVBLouRIbeVn378QR6eyydSkGNf_aiMJpkv93T2h3iko3W4Pzu9y86DFk0hUDHGZZBaxy4tCqpFqmhMtWIu9xFoNUZMcUq5GhbUUM2HSgqJVlmupRWM6yQ1TNIN6I0nY7MJRPgZ9Jqin-XOJEcpU5cYoUzQ8pSbog8HHaFz1Y4a94gXL3mdcgwxd6zJPWvyljV92J_teGvGbPyxdt1T-su6hsh9GHTMzFtF_MhjZMJleBjRrV-2bcOC-ztrrlUG0CvfK7PjAo1S7tYC9gnOPs3d |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwED7xGGDhjShPDywMCW3t2LkNCKACpQipCLYofgkEtAiShV-PnQdCIBBbBlux7nxP390HsOt0HlOaYYCmEwWMRSLAzMmjttZ62OnICN_gfDngvRt2fhfd1c3qZS-MMaYsPjOh_yzf8vVYFT5V5iScMmduxSRMO8PPoqpd6zOl4r0H7GDzGNnG_WGShB4fPKTU-QEeQvKL8SnRVH6o4NKunM7DoDlRVU7yGBa5DNX7t2GN_z7yAszVHiY5rK7EIkyY0RLMN-gNpBbmZTi4ctri-eHd2S6SPI0LTY6zPCP97NGQ68LPP34jtw_5PcnIka9_VEaTxBd8Og1EPNbRCtycngyTXlDjKQSqi908iK1jmBYZ1SJWtEu1Yi76EWg1dpjilHLVzqihmreVFBKtslxLKxjXUWyYpKswNRqPzBoQ4afQa4p-mjuTHKWMXWiEMkLLY26yFuw3hE5VPWzcY148pWXQ0cbUsSb1rElr1rRg73PHSzVo44-1K57SX9ZVRG7BZsPMtBbFt7SLTLgYDzt0_ZdtOzDTG1720_7Z4GIDZt2fWJVk2YSp_LUwW87tyOV2edk-AMJb0So |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Optimizing+Cloud+Data+Lake+Queries+With+a+Balanced+Coverage+Plan&rft.jtitle=IEEE+transactions+on+cloud+computing&rft.au=Weintraub%2C+Grisha&rft.au=Gudes%2C+Ehud&rft.au=Dolev%2C+Shlomi&rft.au=Ullman%2C+Jeffrey+D&rft.date=2024-01-01&rft.pub=The+Institute+of+Electrical+and+Electronics+Engineers%2C+Inc.+%28IEEE%29&rft.eissn=2372-0018&rft.volume=12&rft.issue=1&rft.spage=84&rft_id=info:doi/10.1109%2FTCC.2023.3339208&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2168-7161&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2168-7161&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2168-7161&client=summon |