Algorithmic strategies for optimizing the parallel reduction primitive in CUDA

Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of nove...

Full description

Saved in:

Bibliographic Details
Published in	2012 International Conference on High Performance Computing & Simulation (HPCS) pp. 511 - 519
Main Authors	Martin, P. J., Ayuso, L. F., Torres, R., Gavilanes, A.
Format	Conference Proceeding
Language	English
Published	IEEE 01.07.2012
Subjects	Arrays CUDA data-parallel algorithms GPGPU Graphics processing unit Instruction sets Kernel Optimization parallel reduction segmented parallel reduction Synchronization Throughput
Online Access	Get full text

Cover

Loading…

Abstract	Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
AbstractList	Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones.
Author	Gavilanes, A. Martin, P. J. Torres, R. Ayuso, L. F.
Author_xml	– sequence: 1 givenname: P. J. surname: Martin fullname: Martin, P. J. email: pjmartin@sip.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 2 givenname: L. F. surname: Ayuso fullname: Ayuso, L. F. email: lf.ayuso@fdi.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 3 givenname: R. surname: Torres fullname: Torres, R. email: r.torres@fdi.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 4 givenname: A. surname: Gavilanes fullname: Gavilanes, A. email: agav@sip.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain
BookMark	eNotUNtKxDAUjKigu_YL9iU_0JpbT5vHUi8rLCroPi9pe9KN9EYaBf16K-68DDMMwzArcjGMAxKy4SzhnOnb7Wv55vpEMC4SEAAa4IysuIJMCgkczkmks_ykU51fkWieP9iCxeUpXJPnomtH78KxdzWdgzcBW4cztaOn4xRc737c0NJwRDoZb7oOO-qx-ayDGwc6-SUQ3BdSN9Byf1fckEtruhmjE6_J_uH-vdzGu5fHp7LYxY7LFGIr6tpmoLJKa0AmpLJW5KIRItW2UnmWSpNVklUNs0qlHEwtJasFWt3wBrRck81_r0PEw98M478PpwvkL-XSUio
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/HPCSim.2012.6266966
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library Online IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library Online url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1467323616 9781467323611 1467323624 9781467323628
EndPage	519
ExternalDocumentID	6266966
Genre	orig-research
GroupedDBID	6IE 6IF 6IK 6IL 6IN AAJGR ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL
ID	FETCH-LOGICAL-i1356-f2ccf7647b996e0234ff282d2259fb48753a7b30bd0f44516ac330c2ef9d1d693
IEDL.DBID	RIE
ISBN	9781467323598 1467323594
IngestDate	Wed Jun 26 19:24:10 EDT 2024
IsDoiOpenAccess	false
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i1356-f2ccf7647b996e0234ff282d2259fb48753a7b30bd0f44516ac330c2ef9d1d693
OpenAccessLink	http://www.csd.uwo.ca/~moreno/CS433-CS9624/Resources/Algorithmic_strategies_for_optimizing_the_parallel_reduction_primitive_in_CUDA.pdf
PageCount	9
ParticipantIDs	ieee_primary_6266966
PublicationCentury	2000
PublicationDate	2012-July
PublicationDateYYYYMMDD	2012-07-01
PublicationDate_xml	– month: 07 year: 2012 text: 2012-July
PublicationDecade	2010
PublicationTitle	2012 International Conference on High Performance Computing & Simulation (HPCS)
PublicationTitleAbbrev	HPCSim
PublicationYear	2012
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0000781156
Score	1.6116725
Snippet	Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are...
SourceID	ieee
SourceType	Publisher
StartPage	511
SubjectTerms	Arrays CUDA data-parallel algorithms GPGPU Graphics processing unit Instruction sets Kernel Optimization parallel reduction segmented parallel reduction Synchronization Throughput
Title	Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
URI	https://ieeexplore.ieee.org/document/6266966
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwGA3bTp5UNvE3OXi0XdK0SXMc0zEEx0AHu43mlxa7dszusr_epO0qigdvbSkl5CP53ku_9z4A7hK7wrgyyNOW5Xih1tRLohh70sJlpoxipi6QndHpInxaRssOuG-1MFrrqvhM--6y-pevCrlzR2VDC76phedd0GWc11qt9jzFmdZYLlJptygjAYka37-4vY8b1yGM-HA6H7-kTomOA7_57I_-KlV6mRyD58PA6qqSD39XCl_uf3k2_nfkJ2DwLeSD8zZFnYKOzvtgNsreim1avq9TCT_Lg1cEtPAVFnYHWad7-zK0yBA6Y_As0xncOodXF0O4cX3A3B4J0xyOFw-jAVhMHl_HU69pq-ClmETUM4GUhtGQCct1tM3ZoTGWeCm7srkRFYFJmCBIKGScfRlNJCFIBtpwhRXl5Az08iLX5wBybuKEES6dLb9AXGCJkcRYxpolWIcXoO_mYrWpnTNWzTRc_v34Chy5eNTFsNegV253-sam_FLcVrH-AutMp_M
link.rule.ids	310,311,783,787,792,793,799,27937,55086
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT4MwGG3mPOhJzWb8bQ8ehVEKhR6X6TJ1W5a4Jbst9JcSGSyTXfbX2wLDaDx4A0JI0y_t91753vsAuIv0CqNCOZbULMfypCRW5IfI4houB0KJQJUFsmMymHnPc3_eAPe1FkZKWRSfSdtcFv_yRcY35qiso8E30fB8D-xrXB2SUq1Vn6gY2xrNRgr1Fgmwi_3K-S-s78PKdwg5tDOY9F5jo0VHrl19-EeHlSLB9I_AaDe0sq7kw97kzObbX66N_x37MWh_S_ngpE5SJ6Ah0xYYd5O3bB3n78uYw8985xYBNYCFmd5DlvFWvww1NoTGGjxJZALXxuPVRBGuTCcws0vCOIW92UO3DWb9x2lvYFWNFawYYZ9YyuVcBcQLmGY7UmdtTylNvYRe21SxgsJEAcMOE44yBmYk4hg73JWKCiQIxaegmWapPAOQUhVGAabcGPMzhzLEkcMR4qEMIiS9c9Ayc7FYld4Zi2oaLv5-fAsOBtPRcDF8Gr9cgkMTm7I09go08_VGXmsAkLObIu5f53qrPg
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+International+Conference+on+High+Performance+Computing+%26+Simulation+%28HPCS%29&rft.atitle=Algorithmic+strategies+for+optimizing+the+parallel+reduction+primitive+in+CUDA&rft.au=Martin%2C+P.+J.&rft.au=Ayuso%2C+L.+F.&rft.au=Torres%2C+R.&rft.au=Gavilanes%2C+A.&rft.date=2012-07-01&rft.pub=IEEE&rft.isbn=9781467323598&rft.spage=511&rft.epage=519&rft_id=info:doi/10.1109%2FHPCSim.2012.6266966&rft.externalDocID=6266966
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/lc.gif&client=summon&freeimage=true
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/mc.gif&client=summon&freeimage=true
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/sc.gif&client=summon&freeimage=true