Algorithmic strategies for optimizing the parallel reduction primitive in CUDA
Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of nove...
Saved in:
Published in | 2012 International Conference on High Performance Computing & Simulation (HPCS) pp. 511 - 519 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.07.2012
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones. |
---|---|
AbstractList | Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are usually invoked from the host many times, so their throughput has a great impact on the performance of the overall system. Thus, the study of novel algorithmic strategies to optimize their implementation on current devices is an interesting topic to the GPU community. In this paper we focus on optimizing the reduction primitive, which merely reduces a data sequence into a single value using a binary associative operator. Although tree-based and sequential-based algorithms have been already implemented on GPUs, a comparison of both algorithm performance had not been carried out yet. Thus, our first contribution is to present an experimental study of state-of-the-art reduction algorithms on CUDA. Next we introduce two algorithmic optimizations that are integrated into the fastest solution (a sequential-based algorithm), improving its throughput even more. Finally, we replicate this methodology to the segmented version of the primitive, which applies when the input is composed of several independent segments. In this case, it is not clear which algorithm exhibits the best performance, since throughput deeply depends on the distribution of segments along the input. According to our results, tree-based algorithms run faster for small segments, while sequential methods are better for medium and large ones. |
Author | Gavilanes, A. Martin, P. J. Torres, R. Ayuso, L. F. |
Author_xml | – sequence: 1 givenname: P. J. surname: Martin fullname: Martin, P. J. email: pjmartin@sip.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 2 givenname: L. F. surname: Ayuso fullname: Ayuso, L. F. email: lf.ayuso@fdi.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 3 givenname: R. surname: Torres fullname: Torres, R. email: r.torres@fdi.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain – sequence: 4 givenname: A. surname: Gavilanes fullname: Gavilanes, A. email: agav@sip.ucm.es organization: Dept. de Sist. Informaticos y Comput., Univ. Complutense de Madrid, Madrid, Spain |
BookMark | eNotUNtKxDAUjKigu_YL9iU_0JpbT5vHUi8rLCroPi9pe9KN9EYaBf16K-68DDMMwzArcjGMAxKy4SzhnOnb7Wv55vpEMC4SEAAa4IysuIJMCgkczkmks_ykU51fkWieP9iCxeUpXJPnomtH78KxdzWdgzcBW4cztaOn4xRc737c0NJwRDoZb7oOO-qx-ayDGwc6-SUQ3BdSN9Byf1fckEtruhmjE6_J_uH-vdzGu5fHp7LYxY7LFGIr6tpmoLJKa0AmpLJW5KIRItW2UnmWSpNVklUNs0qlHEwtJasFWt3wBrRck81_r0PEw98M478PpwvkL-XSUio |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/HPCSim.2012.6266966 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library Online IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library Online url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 1467323616 9781467323611 1467323624 9781467323628 |
EndPage | 519 |
ExternalDocumentID | 6266966 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK IEGSK IERZE OCL RIE RIL |
ID | FETCH-LOGICAL-i1356-f2ccf7647b996e0234ff282d2259fb48753a7b30bd0f44516ac330c2ef9d1d693 |
IEDL.DBID | RIE |
ISBN | 9781467323598 1467323594 |
IngestDate | Wed Jun 26 19:24:10 EDT 2024 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i1356-f2ccf7647b996e0234ff282d2259fb48753a7b30bd0f44516ac330c2ef9d1d693 |
OpenAccessLink | http://www.csd.uwo.ca/~moreno/CS433-CS9624/Resources/Algorithmic_strategies_for_optimizing_the_parallel_reduction_primitive_in_CUDA.pdf |
PageCount | 9 |
ParticipantIDs | ieee_primary_6266966 |
PublicationCentury | 2000 |
PublicationDate | 2012-July |
PublicationDateYYYYMMDD | 2012-07-01 |
PublicationDate_xml | – month: 07 year: 2012 text: 2012-July |
PublicationDecade | 2010 |
PublicationTitle | 2012 International Conference on High Performance Computing & Simulation (HPCS) |
PublicationTitleAbbrev | HPCSim |
PublicationYear | 2012 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0000781156 |
Score | 1.6116725 |
Snippet | Many general-purpose applications exploit Graphics Processing Units (GPUs) by executing a set of well-known dataparallel primitives. Those primitives are... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 511 |
SubjectTerms | Arrays CUDA data-parallel algorithms GPGPU Graphics processing unit Instruction sets Kernel Optimization parallel reduction segmented parallel reduction Synchronization Throughput |
Title | Algorithmic strategies for optimizing the parallel reduction primitive in CUDA |
URI | https://ieeexplore.ieee.org/document/6266966 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PS8MwGA3bTp5UNvE3OXi0XdK0SXMc0zEEx0AHu43mlxa7dszusr_epO0qigdvbSkl5CP53ku_9z4A7hK7wrgyyNOW5Xih1tRLohh70sJlpoxipi6QndHpInxaRssOuG-1MFrrqvhM--6y-pevCrlzR2VDC76phedd0GWc11qt9jzFmdZYLlJptygjAYka37-4vY8b1yGM-HA6H7-kTomOA7_57I_-KlV6mRyD58PA6qqSD39XCl_uf3k2_nfkJ2DwLeSD8zZFnYKOzvtgNsreim1avq9TCT_Lg1cEtPAVFnYHWad7-zK0yBA6Y_As0xncOodXF0O4cX3A3B4J0xyOFw-jAVhMHl_HU69pq-ClmETUM4GUhtGQCct1tM3ZoTGWeCm7srkRFYFJmCBIKGScfRlNJCFIBtpwhRXl5Az08iLX5wBybuKEES6dLb9AXGCJkcRYxpolWIcXoO_mYrWpnTNWzTRc_v34Chy5eNTFsNegV253-sam_FLcVrH-AutMp_M |
link.rule.ids | 310,311,783,787,792,793,799,27937,55086 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT4MwGG3mPOhJzWb8bQ8ehVEKhR6X6TJ1W5a4Jbst9JcSGSyTXfbX2wLDaDx4A0JI0y_t91753vsAuIv0CqNCOZbULMfypCRW5IfI4houB0KJQJUFsmMymHnPc3_eAPe1FkZKWRSfSdtcFv_yRcY35qiso8E30fB8D-xrXB2SUq1Vn6gY2xrNRgr1Fgmwi_3K-S-s78PKdwg5tDOY9F5jo0VHrl19-EeHlSLB9I_AaDe0sq7kw97kzObbX66N_x37MWh_S_ngpE5SJ6Ah0xYYd5O3bB3n78uYw8985xYBNYCFmd5DlvFWvww1NoTGGjxJZALXxuPVRBGuTCcws0vCOIW92UO3DWb9x2lvYFWNFawYYZ9YyuVcBcQLmGY7UmdtTylNvYRe21SxgsJEAcMOE44yBmYk4hg73JWKCiQIxaegmWapPAOQUhVGAabcGPMzhzLEkcMR4qEMIiS9c9Ayc7FYld4Zi2oaLv5-fAsOBtPRcDF8Gr9cgkMTm7I09go08_VGXmsAkLObIu5f53qrPg |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2012+International+Conference+on+High+Performance+Computing+%26+Simulation+%28HPCS%29&rft.atitle=Algorithmic+strategies+for+optimizing+the+parallel+reduction+primitive+in+CUDA&rft.au=Martin%2C+P.+J.&rft.au=Ayuso%2C+L.+F.&rft.au=Torres%2C+R.&rft.au=Gavilanes%2C+A.&rft.date=2012-07-01&rft.pub=IEEE&rft.isbn=9781467323598&rft.spage=511&rft.epage=519&rft_id=info:doi/10.1109%2FHPCSim.2012.6266966&rft.externalDocID=6266966 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9781467323598/sc.gif&client=summon&freeimage=true |