An effective algorithm for parallelizing sort merge joins in the presence of data skew
Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a...
Saved in:
Published in | Databases in Parallel and Distributed Systems: 2nd International Symposium pp. 103 - 115 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
New York, NY, USA
ACM
01.07.1990
|
Series | ACM Conferences |
Subjects | |
Online Access | Get full text |
ISBN | 9780818620522 0818620528 |
DOI | 10.1145/319057.319072 |
Cover
Abstract | Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors. |
---|---|
AbstractList | Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors. |
Author | Yu, Philip S. Wolf, Joel L. Dias, Daniel M. |
Author_xml | – sequence: 1 givenname: Joel L. surname: Wolf fullname: Wolf, Joel L. organization: P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center – sequence: 2 givenname: Daniel M. surname: Dias fullname: Dias, Daniel M. organization: P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center – sequence: 3 givenname: Philip S. surname: Yu fullname: Yu, Philip S. organization: P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center |
BookMark | eNqNkEtLAzEUhQMqqLVL91m5sjWPySRZluILCm7UbUgzN23amaQmUwV_vVMquPVuDlw-DpzvEp3GFAGha0qmlFbijlNNhJweQrITNNZSEUVVzYhg7ByNS9mQ4YQYnvwCvc8iBu_B9eETsG1XKYd-3WGfMt7ZbNsW2vAd4gqXlHvcQV4B3qQQCw4R92vAuwwFogOcPG5sb3HZwtcVOvO2LTD-zRF6e7h_nT9NFi-Pz_PZYmKZYP1EuKpR3hGQkgiuqQZXOVF7bmvZNLzWUivNJJc1r4FWtZeVYkRXqtFqGCD5CN0ce3c5feyh9KYLxUHb2ghpXwynXGhK6R9oXWeWKW2LocQcjJmjMXM0NoC3_wLNMgfw_Ac_GGv1 |
ContentType | Conference Proceeding |
Copyright | 1990 ACM |
Copyright_xml | – notice: 1990 ACM |
DBID | 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1145/319057.319072 |
DatabaseName | Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
Editor | Agrawal, Rakesh Bell, David |
Editor_xml | – sequence: 1 givenname: Rakesh surname: Agrawal fullname: Agrawal, Rakesh organization: IBM Almaden Research Center, San Jose, CA – sequence: 2 givenname: David surname: Bell fullname: Bell, David organization: Univ. of Ulster, Jordanstown, Co. Antrim, Northern Ireland, UK |
EndPage | 115 |
Genre | Conference Paper |
GroupedDBID | 6IE 6IK AAJGR ACGHX ACM ADPZR ALMA_UNASSIGNED_HOLDINGS APO BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK GUFHI OCL RIB RIC RIE 7SC 8FD AAWTH JQ2 L7M LHSKQ L~C L~D |
ID | FETCH-LOGICAL-a252t-5c4d8fc0e77053919ec4c56f3a67dd36979892737636e146f74820948d9858173 |
ISBN | 9780818620522 0818620528 |
IngestDate | Fri Jul 11 07:57:14 EDT 2025 Wed Jan 31 06:47:55 EST 2024 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
License | Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org |
LinkModel | OpenURL |
MeetingName | DPDS90: International Symposium on Database for Parallel Distributed Systems |
MergedId | FETCHMERGED-LOGICAL-a252t-5c4d8fc0e77053919ec4c56f3a67dd36979892737636e146f74820948d9858173 |
Notes | SourceType-Conference Papers & Proceedings-1 ObjectType-Conference Paper-1 content type line 25 |
PQID | 31359111 |
PQPubID | 23500 |
PageCount | 13 |
ParticipantIDs | acm_books_10_1145_319057_319072 proquest_miscellaneous_31359111 acm_books_10_1145_319057_319072_brief |
PublicationCentury | 1900 |
PublicationDate | 19900701 |
PublicationDateYYYYMMDD | 1990-07-01 |
PublicationDate_xml | – month: 07 year: 1990 text: 19900701 day: 01 |
PublicationDecade | 1990 |
PublicationPlace | New York, NY, USA |
PublicationPlace_xml | – name: New York, NY, USA |
PublicationSeriesTitle | ACM Conferences |
PublicationTitle | Databases in Parallel and Distributed Systems: 2nd International Symposium |
PublicationYear | 1990 |
Publisher | ACM |
Publisher_xml | – name: ACM |
SSID | ssj0000558183 |
Score | 1.2343279 |
Snippet | Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional... |
SourceID | proquest acm |
SourceType | Aggregation Database Publisher |
StartPage | 103 |
SubjectTerms | Information systems -- Data management systems -- Database design and models Information systems -- Data management systems -- Query languages |
Title | An effective algorithm for parallelizing sort merge joins in the presence of data skew |
URI | https://www.proquest.com/docview/31359111 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELagJ05QKKJAwQc4RVnycpwcEQ9ViEVItFBOkZ800E2kTfZAfz0zdh67tFIFl-zKa9m78816xuP5xoS80IWKUslNiJcfhZmUJpRKJGEs0ljazJpYuSzfT_nxafbhjJ3NqUOOXdLLhbq8llfyP6hCG-CKLNl_QHYaFBrgPeALT0AYnn85v9famc9TYzce9Xe4v9WuCMQc5-t-rzA3a7PCk4G3ohdounz-uFjjXSq-XIDGGrp4_RX4oN1WIfMxnbkZkj8w10hc_GjXdX--cmmK4zD1pYtOgEcfOFZn8LOtfaK6J2Qh00m5IAUmpgbdr_lU4lt7YX00H77Nx8XkX9eecOaZ8MFy-uD7Zg4HBV98M8rcdKBdyy0i41ZcIwabOOXAjpq43NnqutJ7ScQ8i3lYbuMo3bLcsSeGXjUKGdbPgLUGfNMFvnCw2bd5EXu63xSSixiDSVJXJXSYrBhqNE2Tj2VaM_ZqZ0B0btTqikF3XsrJPXIw_2w6K8c-uWWa--TueIEHHdbzB-Tr64ZOmNIJUwqY0h1MKWJKHabUYUrrhgKmdMSUtpYiphQxPSCn79-dvDkOh4s2QpGwpA-ZynRhVWQ4hzW5jEujMsVym4qca53mJS-LEvxcsEW5AdNqeQaOY5kVuixAZDx9SPaatjGPCBVxKSLLYRsPHaGXZGA1TJ5IDcbCRvqQPAc5Vfgn6ipPimeVl2TlJXlIXt7Qo5KgTRZGGmVdwZKI51yiMe2mg14pQxv--Ma5npA7s-o9JXv9emOOwMns5TOnGn8AZKp3iQ |
linkProvider | IEEE |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+second+international+symposium+on+Databases+in+parallel+and+distributed+systems&rft.atitle=An+effective+algorithm+for+parallelizing+sort+merge+joins+in+the+presence+of+data+skew&rft.au=Wolf%2C+Joel+L.&rft.au=Dias%2C+Daniel+M.&rft.au=Yu%2C+Philip+S.&rft.series=ACM+Conferences&rft.date=1990-07-01&rft.pub=ACM&rft.isbn=9780818620522&rft.spage=103&rft.epage=115&rft_id=info:doi/10.1145%2F319057.319072 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780818620522/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780818620522/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780818620522/sc.gif&client=summon&freeimage=true |