An effective algorithm for parallelizing sort merge joins in the presence of data skew

Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a...

Full description

Saved in:
Bibliographic Details
Published inDatabases in Parallel and Distributed Systems: 2nd International Symposium pp. 103 - 115
Main Authors Wolf, Joel L., Dias, Daniel M., Yu, Philip S.
Format Conference Proceeding
LanguageEnglish
Published New York, NY, USA ACM 01.07.1990
SeriesACM Conferences
Subjects
Online AccessGet full text
ISBN9780818620522
0818620528
DOI10.1145/319057.319072

Cover

Abstract Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors.
AbstractList Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional parallel join algorithms can be very limited, due to load imbalances among the various processors. Even a single large skew element can cause a processor to become overloaded. In this paper, we propose a parallel sort merge join algorithm which uses a divide-and-conquer approach to address the data skew problem. The proposed algorithm adds an extra scheduling phase to the usual sort, transfer and join phases. During the scheduling phase, a parallelizable optimization algorithm, using the output of the sort phase, attempts to balance the load across the multiple processors in the subsequent join phase. The algorithm naturally identifies the largest skew elements, and assigns each of them to an optimal number of processors. Assuming a Zipf-like distribution for data skew, the algorithm is demonstrated to achieve very good load balancing for the join phase in a CPU-bound environment, and is shown to be very robust relative to the degree of data skew and the total number of processors.
Author Yu, Philip S.
Wolf, Joel L.
Dias, Daniel M.
Author_xml – sequence: 1
  givenname: Joel L.
  surname: Wolf
  fullname: Wolf, Joel L.
  organization: P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center
– sequence: 2
  givenname: Daniel M.
  surname: Dias
  fullname: Dias, Daniel M.
  organization: P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center
– sequence: 3
  givenname: Philip S.
  surname: Yu
  fullname: Yu, Philip S.
  organization: P.O. Box 704, Yorktown Heights, N.Y. 10598, IBM Research Division, T. J. Watson Research Center
BookMark eNqNkEtLAzEUhQMqqLVL91m5sjWPySRZluILCm7UbUgzN23amaQmUwV_vVMquPVuDlw-DpzvEp3GFAGha0qmlFbijlNNhJweQrITNNZSEUVVzYhg7ByNS9mQ4YQYnvwCvc8iBu_B9eETsG1XKYd-3WGfMt7ZbNsW2vAd4gqXlHvcQV4B3qQQCw4R92vAuwwFogOcPG5sb3HZwtcVOvO2LTD-zRF6e7h_nT9NFi-Pz_PZYmKZYP1EuKpR3hGQkgiuqQZXOVF7bmvZNLzWUivNJJc1r4FWtZeVYkRXqtFqGCD5CN0ce3c5feyh9KYLxUHb2ghpXwynXGhK6R9oXWeWKW2LocQcjJmjMXM0NoC3_wLNMgfw_Ac_GGv1
ContentType Conference Proceeding
Copyright 1990 ACM
Copyright_xml – notice: 1990 ACM
DBID 7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1145/319057.319072
DatabaseName Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
Editor Agrawal, Rakesh
Bell, David
Editor_xml – sequence: 1
  givenname: Rakesh
  surname: Agrawal
  fullname: Agrawal, Rakesh
  organization: IBM Almaden Research Center, San Jose, CA
– sequence: 2
  givenname: David
  surname: Bell
  fullname: Bell, David
  organization: Univ. of Ulster, Jordanstown, Co. Antrim, Northern Ireland, UK
EndPage 115
Genre Conference Paper
GroupedDBID 6IE
6IK
AAJGR
ACGHX
ACM
ADPZR
ALMA_UNASSIGNED_HOLDINGS
APO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
GUFHI
OCL
RIB
RIC
RIE
7SC
8FD
AAWTH
JQ2
L7M
LHSKQ
L~C
L~D
ID FETCH-LOGICAL-a252t-5c4d8fc0e77053919ec4c56f3a67dd36979892737636e146f74820948d9858173
ISBN 9780818620522
0818620528
IngestDate Fri Jul 11 07:57:14 EDT 2025
Wed Jan 31 06:47:55 EST 2024
IsPeerReviewed false
IsScholarly false
Language English
License Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org
LinkModel OpenURL
MeetingName DPDS90: International Symposium on Database for Parallel Distributed Systems
MergedId FETCHMERGED-LOGICAL-a252t-5c4d8fc0e77053919ec4c56f3a67dd36979892737636e146f74820948d9858173
Notes SourceType-Conference Papers & Proceedings-1
ObjectType-Conference Paper-1
content type line 25
PQID 31359111
PQPubID 23500
PageCount 13
ParticipantIDs acm_books_10_1145_319057_319072
proquest_miscellaneous_31359111
acm_books_10_1145_319057_319072_brief
PublicationCentury 1900
PublicationDate 19900701
PublicationDateYYYYMMDD 1990-07-01
PublicationDate_xml – month: 07
  year: 1990
  text: 19900701
  day: 01
PublicationDecade 1990
PublicationPlace New York, NY, USA
PublicationPlace_xml – name: New York, NY, USA
PublicationSeriesTitle ACM Conferences
PublicationTitle Databases in Parallel and Distributed Systems: 2nd International Symposium
PublicationYear 1990
Publisher ACM
Publisher_xml – name: ACM
SSID ssj0000558183
Score 1.2343279
Snippet Parallel processing of relational queries has received considerable attention of late. However, in the presence of data skew, the speedup from conventional...
SourceID proquest
acm
SourceType Aggregation Database
Publisher
StartPage 103
SubjectTerms Information systems -- Data management systems -- Database design and models
Information systems -- Data management systems -- Query languages
Title An effective algorithm for parallelizing sort merge joins in the presence of data skew
URI https://www.proquest.com/docview/31359111
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELagJ05QKKJAwQc4RVnycpwcEQ9ViEVItFBOkZ800E2kTfZAfz0zdh67tFIFl-zKa9m78816xuP5xoS80IWKUslNiJcfhZmUJpRKJGEs0ljazJpYuSzfT_nxafbhjJ3NqUOOXdLLhbq8llfyP6hCG-CKLNl_QHYaFBrgPeALT0AYnn85v9famc9TYzce9Xe4v9WuCMQc5-t-rzA3a7PCk4G3ohdounz-uFjjXSq-XIDGGrp4_RX4oN1WIfMxnbkZkj8w10hc_GjXdX--cmmK4zD1pYtOgEcfOFZn8LOtfaK6J2Qh00m5IAUmpgbdr_lU4lt7YX00H77Nx8XkX9eecOaZ8MFy-uD7Zg4HBV98M8rcdKBdyy0i41ZcIwabOOXAjpq43NnqutJ7ScQ8i3lYbuMo3bLcsSeGXjUKGdbPgLUGfNMFvnCw2bd5EXu63xSSixiDSVJXJXSYrBhqNE2Tj2VaM_ZqZ0B0btTqikF3XsrJPXIw_2w6K8c-uWWa--TueIEHHdbzB-Tr64ZOmNIJUwqY0h1MKWJKHabUYUrrhgKmdMSUtpYiphQxPSCn79-dvDkOh4s2QpGwpA-ZynRhVWQ4hzW5jEujMsVym4qca53mJS-LEvxcsEW5AdNqeQaOY5kVuixAZDx9SPaatjGPCBVxKSLLYRsPHaGXZGA1TJ5IDcbCRvqQPAc5Vfgn6ipPimeVl2TlJXlIXt7Qo5KgTRZGGmVdwZKI51yiMe2mg14pQxv--Ma5npA7s-o9JXv9emOOwMns5TOnGn8AZKp3iQ
linkProvider IEEE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+of+the+second+international+symposium+on+Databases+in+parallel+and+distributed+systems&rft.atitle=An+effective+algorithm+for+parallelizing+sort+merge+joins+in+the+presence+of+data+skew&rft.au=Wolf%2C+Joel+L.&rft.au=Dias%2C+Daniel+M.&rft.au=Yu%2C+Philip+S.&rft.series=ACM+Conferences&rft.date=1990-07-01&rft.pub=ACM&rft.isbn=9780818620522&rft.spage=103&rft.epage=115&rft_id=info:doi/10.1145%2F319057.319072
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780818620522/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780818620522/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780818620522/sc.gif&client=summon&freeimage=true