Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of nove...

Full description

Saved in:
Bibliographic Details
Published in2012 International Conference on High Performance Computing & Simulation (HPCS) pp. 511 - 519
Main Authors Martin, P. J., Ayuso, L. F., Torres, R., Gavilanes, A.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.07.2012
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
AbstractList Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
Author Gavilanes, A.
Martin, P. J.
Torres, R.
Ayuso, L. F.
Author_xml – sequence: 1
  givenname: P. J.
  surname: Martin
  fullname: Martin, P. J.
  email: pjmartin@sip.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
– sequence: 2
  givenname: L. F.
  surname: Ayuso
  fullname: Ayuso, L. F.
  email: lf.ayuso@fdi.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
– sequence: 3
  givenname: R.
  surname: Torres
  fullname: Torres, R.
  email: r.torres@fdi.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
– sequence: 4
  givenname: A.
  surname: Gavilanes
  fullname: Gavilanes, A.
  email: agav@sip.ucm.es
  organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
BookMark eNotUNtKxDAUjKigu_YL9iU_0JpbT5vHUi8rLCroPi9pe9KN9EYaBf16K-68DDMMwzArcjGMAxKy4SzhnOnb7Wv55vpEMC4SEAAa4IysuIJMCgkczkmks_ykU51fkWieP9iCxeUpXJPnomtH78KxdzWdgzcBW4cztaOn4xRc737c0NJwRDoZb7oOO-qx-ayDGwc6-SUQ3BdSN9Byf1fckEtruhmjE6_J_uH-vdzGu5fHp7LYxY7LFGIr6tpmoLJKa0AmpLJW5KIRItW2UnmWSpNVklUNs0qlHEwtJasFWt3wBrRck81_r0PEw98M478PpwvkL-XSUio
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/HPCSim.2012.6266966
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library Online
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library Online
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1467323616
9781467323611
1467323624
9781467323628
EndPage 519
ExternalDocumentID 6266966
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
IEGSK
IERZE
OCL
RIE
RIL
ID FETCH-LOGICAL-i1356-f2ccf7647b996e0234ff282d2259fb48753a7b30bd0f44516ac330c2ef9d1d693
IEDL.DBID RIE
ISBN 9781467323598
1467323594
IngestDate Wed Jun 26 19:24:10 EDT 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i1356-f2ccf7647b996e0234ff282d2259fb48753a7b30bd0f44516ac330c2ef9d1d693
OpenAccessLink http://www.csd.uwo.ca/~moreno/CS433-CS9624/Resources/Algorithmic_strategies_for_optimizing_the_parallel_reduction_primitive_in_CUDA.pdf
PageCount 9
ParticipantIDs ieee_primary_6266966
PublicationCentury 2000
PublicationDate 2012-July
PublicationDateYYYYMMDD 2012-07-01
PublicationDate_xml – month: 07
  year: 2012
  text: 2012-July
PublicationDecade 2010
PublicationTitle 2012 International Conference on High Performance Computing & Simulation (HPCS)
PublicationTitleAbbrev HPCSim
PublicationYear 2012
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000781156
Score 1.6116725
Snippet Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are...
SourceID ieee
SourceType Publisher
StartPage 511
SubjectTerms Arrays
CUDA
data-parallel algorithms
GPGPU
Graphics processing unit
Instruction sets
Kernel
Optimization
parallel reduction
segmented parallel reduction
Synchronization
Throughput
Title Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
URI https://ieeexplore.ieee.org/document/6266966
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwGA3bTp5UNvE3OXi0XdK0SXMc0zEEx0AHu43mlxa7dszusr_epO0qigdvbSkl5CP53ku_9z4A7hK7wrgyyNOW5Xih1tRLohh70sJlpoxipi6QndHpInxaRssOuG-1MFrrqvhM--6y-pevCrlzR2VDC76phedd0GWc11qt9jzFmdZYLlJptygjAYka37-4vY8b1yGM-HA6H7-kTomOA7_57I_-KlV6mRyD58PA6qqSD39XCl_uf3k2_nfkJ2DwLeSD8zZFnYKOzvtgNsreim1avq9TCT_Lg1cEtPAVFnYHWad7-zK0yBA6Y_As0xncOodXF0O4cX3A3B4J0xyOFw-jAVhMHl_HU69pq-ClmETUM4GUhtGQCct1tM3ZoTGWeCm7srkRFYFJmCBIKGScfRlNJCFIBtpwhRXl5Az08iLX5wBybuKEES6dLb9AXGCJkcRYxpolWIcXoO_mYrWpnTNWzTRc_v34Chy5eNTFsNegV253-sam_FLcVrH-AutMp_M
link.rule.ids 310,311,783,787,792,793,799,27937,55086
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT4MwGG3mPOhJzWb8bQ8ehVEKhR6X6TJ1W5a4Jbst9JcSGSyTXfbX2wLDaDx4A0JI0y_t91753vsAuIv0CqNCOZbULMfypCRW5IfI4houB0KJQJUFsmMymHnPc3_eAPe1FkZKWRSfSdtcFv_yRcY35qiso8E30fB8D-xrXB2SUq1Vn6gY2xrNRgr1Fgmwi_3K-S-s78PKdwg5tDOY9F5jo0VHrl19-EeHlSLB9I_AaDe0sq7kw97kzObbX66N_x37MWh_S_ngpE5SJ6Ah0xYYd5O3bB3n78uYw8985xYBNYCFmd5DlvFWvww1NoTGGjxJZALXxuPVRBGuTCcws0vCOIW92UO3DWb9x2lvYFWNFawYYZ9YyuVcBcQLmGY7UmdtTylNvYRe21SxgsJEAcMOE44yBmYk4hg73JWKCiQIxaegmWapPAOQUhVGAabcGPMzhzLEkcMR4qEMIiS9c9Ayc7FYld4Zi2oaLv5-fAsOBtPRcDF8Gr9cgkMTm7I09go08_VGXmsAkLObIu5f53qrPg
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+International+Conference+on+High+Performance+Computing+%26+Simulation+%28HPCS%29&rft.atitle=Algorithmic+strategies+for+optimizing+the+parallel+reduction+primitive+in+CUDA&rft.au=Martin%2C+P.+J.&rft.au=Ayuso%2C+L.+F.&rft.au=Torres%2C+R.&rft.au=Gavilanes%2C+A.&rft.date=2012-07-01&rft.pub=IEEE&rft.isbn=9781467323598&rft.spage=511&rft.epage=519&rft_id=info:doi/10.1109%2FHPCSim.2012.6266966&rft.externalDocID=6266966
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/sc.gif&client=summon&freeimage=true