Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload

The use of deep learning models for forecasting the resource consumption patterns of SQL queries have recently been a popular area of study. With many companies using cloud platforms to power their data lakes for large scale analytic demands, these models form a critical part of the pipeline in mana...

Full description

Saved in:
Bibliographic Details
Main Authors Kang, Johan Kok Zhi, Gaurav, Tan, Sien Yi, Cheng, Feng, Sun, Shixuan, He, Bingsheng
Format Journal Article
LanguageEnglish
Published 23.03.2021
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The use of deep learning models for forecasting the resource consumption patterns of SQL queries have recently been a popular area of study. With many companies using cloud platforms to power their data lakes for large scale analytic demands, these models form a critical part of the pipeline in managing cloud resource provisioning. While these models have demonstrated promising accuracy, training them over large scale industry workloads are expensive. Space inefficiencies of encoding techniques over large numbers of queries and excessive padding used to enforce shape consistency across diverse query plans implies 1) longer model training time and 2) the need for expensive, scaled up infrastructure to support batched training. In turn, we developed Prestroid, a tree convolution based data science pipeline that accurately predicts resource consumption patterns of query traces, but at a much lower cost. We evaluated our pipeline over 19K Presto OLAP queries from Grab, on a data lake of more than 20PB of data. Experimental results imply that our pipeline outperforms benchmarks on predictive accuracy, contributing to more precise resource prediction for large-scale workloads, yet also reduces per-batch memory footprint by 13.5x and per-epoch training time by 3.45x. We demonstrate direct cost savings of up to 13.2x for large batched model training over Microsoft Azure VMs.
AbstractList The use of deep learning models for forecasting the resource consumption patterns of SQL queries have recently been a popular area of study. With many companies using cloud platforms to power their data lakes for large scale analytic demands, these models form a critical part of the pipeline in managing cloud resource provisioning. While these models have demonstrated promising accuracy, training them over large scale industry workloads are expensive. Space inefficiencies of encoding techniques over large numbers of queries and excessive padding used to enforce shape consistency across diverse query plans implies 1) longer model training time and 2) the need for expensive, scaled up infrastructure to support batched training. In turn, we developed Prestroid, a tree convolution based data science pipeline that accurately predicts resource consumption patterns of query traces, but at a much lower cost. We evaluated our pipeline over 19K Presto OLAP queries from Grab, on a data lake of more than 20PB of data. Experimental results imply that our pipeline outperforms benchmarks on predictive accuracy, contributing to more precise resource prediction for large-scale workloads, yet also reduces per-batch memory footprint by 13.5x and per-epoch training time by 3.45x. We demonstrate direct cost savings of up to 13.2x for large batched model training over Microsoft Azure VMs.
Author Kang, Johan Kok Zhi
Cheng, Feng
Tan, Sien Yi
Gaurav
He, Bingsheng
Sun, Shixuan
Author_xml – sequence: 1
  givenname: Johan Kok Zhi
  surname: Kang
  fullname: Kang, Johan Kok Zhi
– sequence: 2
  surname: Gaurav
  fullname: Gaurav
– sequence: 3
  givenname: Sien Yi
  surname: Tan
  fullname: Tan, Sien Yi
– sequence: 4
  givenname: Feng
  surname: Cheng
  fullname: Cheng, Feng
– sequence: 5
  givenname: Shixuan
  surname: Sun
  fullname: Sun, Shixuan
– sequence: 6
  givenname: Bingsheng
  surname: He
  fullname: He, Bingsheng
BackLink https://doi.org/10.48550/arXiv.2103.12465$$DView paper in arXiv
BookMark eNotz81Og0AUhuFZ6EKrF-DKcwPg_MAMLBvEn4SkGpu4JIfh0EzEGTLQxt69Wl19uzffc8nOfPDE2I3gaVbkOb_D-OUOqRRcpUJmOr9gbT0MzjryC9wTTdAQRu_8Dl7cRKPzNMMQIqyt3UdcCKowL1DPi_vExQU_w-ZAERqMO4I3iyPB657iEd5D_BgD9lfsfMBxpuv_XbHtQ72tnpJm8_hcrZsEtcmTctC205yskL00hc1NxkUxaIWmKKnETArTCaN6zqUlrnjfCW0USYNWKlRqxW7_sidhO8Wff_HY_krbk1R9A9hiT7g
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2103.12465
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2103_12465
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a675-9f6cb60ec12d278c574018f63a789e9a4217b173d002ce030db1673e27ac23a33
IEDL.DBID GOX
IngestDate Mon Jan 08 05:49:48 EST 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a675-9f6cb60ec12d278c574018f63a789e9a4217b173d002ce030db1673e27ac23a33
OpenAccessLink https://arxiv.org/abs/2103.12465
ParticipantIDs arxiv_primary_2103_12465
PublicationCentury 2000
PublicationDate 2021-03-23
PublicationDateYYYYMMDD 2021-03-23
PublicationDate_xml – month: 03
  year: 2021
  text: 2021-03-23
  day: 23
PublicationDecade 2020
PublicationYear 2021
Score 1.8000917
SecondaryResourceType preprint
Snippet The use of deep learning models for forecasting the resource consumption patterns of SQL queries have recently been a popular area of study. With many...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Databases
Computer Science - Learning
Title Efficient Deep Learning Pipelines for Accurate Cost Estimations Over Large Scale Query Workload
URI https://arxiv.org/abs/2103.12465
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07T8MwELbaTiwIBKg8dQOrIbUdOxmr0lIhoCCK1C3yORdUCbVV0yL499hJECystuXhs30P3913jF0mGFltbMELpSRXRjluvdPDFeY5FjkhVmTVD496_KruZvGsxeCnFsauP-cfNT8wltfeH5FXXgPpuM3aQoSUrdvJrA5OVlRczfrfdd7GrIb-KInRHtttrDvo18exz1q0OGDZsKJp8NIdbohW0FCavsHTfBWqwakEbzlC37lt4G2AwbLcwNA_vbqqsISJv25wH1K24cVDSvC8pfUXhH_u96XND9l0NJwOxrzpa8A9LDFPC-1QR-R6IhcmcXFoipcUWlqTpJRa5b0E7BmZe2EVunlFOfa0kSSMdUJaKY9YZ7FcUJdBgoQJusK5CFWcWgyBNSeIjPZbF8kx61ZoZKuauiILQGUVUCf_T52yHREyNyLJhTxjnc16S-de9W7wosL_G0cahPk
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Efficient+Deep+Learning+Pipelines+for+Accurate+Cost+Estimations+Over+Large+Scale+Query+Workload&rft.au=Kang%2C+Johan+Kok+Zhi&rft.au=Gaurav&rft.au=Tan%2C+Sien+Yi&rft.au=Cheng%2C+Feng&rft.date=2021-03-23&rft_id=info:doi/10.48550%2Farxiv.2103.12465&rft.externalDocID=2103_12465