Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneou...
Saved in:
Published in | 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 653 - 662 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.05.2016
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/IPDPSW.2016.34 |
Cover
Abstract | Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction. |
---|---|
AbstractList | Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction. |
Author | Luszczek, Piotr Yulu Jia Dongarra, Jack |
Author_xml | – sequence: 1 surname: Yulu Jia fullname: Yulu Jia organization: Univ. of Tennessee, Knoxville, TN, USA – sequence: 2 givenname: Piotr surname: Luszczek fullname: Luszczek, Piotr organization: Univ. of Tennessee, Knoxville, TN, USA – sequence: 3 givenname: Jack surname: Dongarra fullname: Dongarra, Jack organization: Univ. of Tennessee, Knoxville, TN, USA |
BookMark | eNotjM1Kw0AYRUfQhdZu3biZF0j8ZpL5W9Zam0LBoK0uy_x8tQM1kZkU6dsb0NXlcA73hlx2fYeE3DEoGQPzsGqf2rePkgOTZVVfkKlRmgkwUEnN4Zq8N5gzdg7TJ33FcPJD7Dv6E4cD3STb5YjdQBcp9WnUOR5H9kjHZNlui0ebMdDm7FIMdJb8IQ7oh1PCfEuu9vaYcfq_E7J9XmzmTbF-Wa7ms3UROeihsHoPgmspvVdCOq84epChDlJUtdeWKW2EV0waB4pDDZYHJ0UNEtBpr6oJuf_7jYi4-07xy6bzTgluDJfVL8zKTXA |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/IPDPSW.2016.34 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 9781509036820 1509036822 |
EndPage | 662 |
ExternalDocumentID | 7529926 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-i208t-a8f052866cc756bc72ec06d4d6534c8a17895c7169b072040a2db654060eb8c73 |
IEDL.DBID | RIE |
IngestDate | Thu Jun 29 18:37:30 EDT 2023 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i208t-a8f052866cc756bc72ec06d4d6534c8a17895c7169b072040a2db654060eb8c73 |
PageCount | 10 |
ParticipantIDs | ieee_primary_7529926 |
PublicationCentury | 2000 |
PublicationDate | 20160501 |
PublicationDateYYYYMMDD | 2016-05-01 |
PublicationDate_xml | – month: 05 year: 2016 text: 20160501 day: 01 |
PublicationDecade | 2010 |
PublicationTitle | 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) |
PublicationTitleAbbrev | IPDPSW |
PublicationYear | 2016 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.6044468 |
Snippet | Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 653 |
SubjectTerms | Algorithm design and analysis Central Processing Unit Fault tolerance Fault tolerant systems GPGPU Graphics processing units Hessenberg reduction similarity transformation Transient analysis Transistors |
Title | Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures |
URI | https://ieeexplore.ieee.org/document/7529926 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwHG6Akyc1YHynB49udGV97OgDnCaYRUW5kbbrEqIBM8ZB_3r76xCJ8eBt2ZZs-XXtt7bfA6EzzS1VpIgDh35ugtJzfU4VMQ-KpMil0Q6hBKiRh_c8HcV3YzZuoPO1FsZa68lnNoRDv5efz80Slsq6grnBk_ImarrPrNZqrXwYI5J0b7Pr7PEF2Fo8hBzkjbQUDxaDbTT8fkzNEXkNl5UOzecvB8b_vscO6vzI8nC2Bpxd1LCzNnpOwf3b07TwA_iwQqUxLK9ij0Ogd8T9spyX7vJi-ua7Mna33GSj4NJhWI7TD5Bt4YuNPYVFB40G_aerNFiFJQRTSmQVKFkQRiXnxgjGtRHUGsLzOOesFxupIiETZsAbRxMIpiGK5pq7_zVOrJZG9PZQazaf2X2ElTVuVmkTGpk4ZhDNESVuSFeaJ0YoURygNtRk8l77YUxW5Tj8-_QR2oImqUmCx6hVlUt74oC80qe-Bb8AbGyggg |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwHG4QD3pSA8a3PXh0oxt9bEcf4FAgi4JyI23XJUQDZoyD_vX2tyES48HbsjXZ8mvab22_B0IXihtfkpQ6Fv3sAqVpx5xMKXfSME0CrSxCCVAj9_o8GtL7ERtV0OVKC2OMKchnxoXL4iw_mekFbJU1BLOTp8830KbFfcpKtdbSidEjYaMT38ZPL8DX4i4kIa_lpRRw0d5Bve8XlSyRV3eRK1d__vJg_O-X7KL6jzAPxyvI2UMVM62h5wj8vwuiFn4EJ1aoNYYNVlwgESgecSvLZpl9PJ-8FYMZ2yZ38dC5tiiW4OgDhFv4au1UYV5Hw3ZrcBM5y7gEZ-KTIHdkkBLmB5xrLRhXWvhGE57QhLMm1YH0RBAyDe44ikA0DZF-orj9Y-PEqECL5j6qTmdTc4CwNNquK03oe5pSBuEcXmgndal4qIUU6SGqQU3G76UjxnhZjqO_b5-jrWjQ6467nf7DMdqG7ikpgyeommcLc2phPVdnRW9-AQK4o88 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+International+Parallel+and+Distributed+Processing+Symposium+Workshops+%28IPDPSW%29&rft.atitle=Hessenberg+Reduction+with+Transient+Error+Resilience+on+GPU-Based+Hybrid+Architectures&rft.au=Yulu+Jia&rft.au=Luszczek%2C+Piotr&rft.au=Dongarra%2C+Jack&rft.date=2016-05-01&rft.pub=IEEE&rft.spage=653&rft.epage=662&rft_id=info:doi/10.1109%2FIPDPSW.2016.34&rft.externalDocID=7529926 |