Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures

Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneou...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 653 - 662
Main Authors Yulu Jia, Luszczek, Piotr, Dongarra, Jack
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2016
Subjects
Online AccessGet full text
DOI10.1109/IPDPSW.2016.34

Cover

Abstract Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
AbstractList Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on computation-intensive applications. In this paper, we present the design and implementation of a Hessenberg reduction algorithm immune to simultaneous soft-errors, capable of taking advantage of hybrid GPU-CPU platforms. These soft-errors are detected and corrected on the fly, preventing the propagation of the error to the rest of the data. Our design is at the intersection between several fault tolerant techniques and employs the algorithm-based fault tolerance technique, diskless checkpointing, and reverse computation to achieve its goal. By utilizing the idle time of the CPUs, and by overlapping both host-side and GPU-side workloads, we minimize the resilience overhead. Experimental results have validated our design decisions as our algorithm introduced less than 2% performance overhead compared to the optimized, but fault-prone, hybrid Hessenberg reduction.
Author Luszczek, Piotr
Yulu Jia
Dongarra, Jack
Author_xml – sequence: 1
  surname: Yulu Jia
  fullname: Yulu Jia
  organization: Univ. of Tennessee, Knoxville, TN, USA
– sequence: 2
  givenname: Piotr
  surname: Luszczek
  fullname: Luszczek, Piotr
  organization: Univ. of Tennessee, Knoxville, TN, USA
– sequence: 3
  givenname: Jack
  surname: Dongarra
  fullname: Dongarra, Jack
  organization: Univ. of Tennessee, Knoxville, TN, USA
BookMark eNotjM1Kw0AYRUfQhdZu3biZF0j8ZpL5W9Zam0LBoK0uy_x8tQM1kZkU6dsb0NXlcA73hlx2fYeE3DEoGQPzsGqf2rePkgOTZVVfkKlRmgkwUEnN4Zq8N5gzdg7TJ33FcPJD7Dv6E4cD3STb5YjdQBcp9WnUOR5H9kjHZNlui0ebMdDm7FIMdJb8IQ7oh1PCfEuu9vaYcfq_E7J9XmzmTbF-Wa7ms3UROeihsHoPgmspvVdCOq84epChDlJUtdeWKW2EV0waB4pDDZYHJ0UNEtBpr6oJuf_7jYi4-07xy6bzTgluDJfVL8zKTXA
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/IPDPSW.2016.34
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9781509036820
1509036822
EndPage 662
ExternalDocumentID 7529926
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i208t-a8f052866cc756bc72ec06d4d6534c8a17895c7169b072040a2db654060eb8c73
IEDL.DBID RIE
IngestDate Thu Jun 29 18:37:30 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i208t-a8f052866cc756bc72ec06d4d6534c8a17895c7169b072040a2db654060eb8c73
PageCount 10
ParticipantIDs ieee_primary_7529926
PublicationCentury 2000
PublicationDate 20160501
PublicationDateYYYYMMDD 2016-05-01
PublicationDate_xml – month: 05
  year: 2016
  text: 20160501
  day: 01
PublicationDecade 2010
PublicationTitle 2016 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)
PublicationTitleAbbrev IPDPSW
PublicationYear 2016
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.6044468
Snippet Graphics Processing Units (GPUs) have been seeing widespread adoption in the field of scientific computing, owing to the performance gains provided on...
SourceID ieee
SourceType Publisher
StartPage 653
SubjectTerms Algorithm design and analysis
Central Processing Unit
Fault tolerance
Fault tolerant systems
GPGPU
Graphics processing units
Hessenberg reduction
similarity transformation
Transient analysis
Transistors
Title Hessenberg Reduction with Transient Error Resilience on GPU-Based Hybrid Architectures
URI https://ieeexplore.ieee.org/document/7529926
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwHG6Akyc1YHynB49udGV97OgDnCaYRUW5kbbrEqIBM8ZB_3r76xCJ8eBt2ZZs-XXtt7bfA6EzzS1VpIgDh35ugtJzfU4VMQ-KpMil0Q6hBKiRh_c8HcV3YzZuoPO1FsZa68lnNoRDv5efz80Slsq6grnBk_ImarrPrNZqrXwYI5J0b7Pr7PEF2Fo8hBzkjbQUDxaDbTT8fkzNEXkNl5UOzecvB8b_vscO6vzI8nC2Bpxd1LCzNnpOwf3b07TwA_iwQqUxLK9ij0Ogd8T9spyX7vJi-ua7Mna33GSj4NJhWI7TD5Bt4YuNPYVFB40G_aerNFiFJQRTSmQVKFkQRiXnxgjGtRHUGsLzOOesFxupIiETZsAbRxMIpiGK5pq7_zVOrJZG9PZQazaf2X2ElTVuVmkTGpk4ZhDNESVuSFeaJ0YoURygNtRk8l77YUxW5Tj8-_QR2oImqUmCx6hVlUt74oC80qe-Bb8AbGyggg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3LT8IwHG4QD3pSA8a3PXh0oxt9bEcf4FAgi4JyI23XJUQDZoyD_vX2tyES48HbsjXZ8mvab22_B0IXihtfkpQ6Fv3sAqVpx5xMKXfSME0CrSxCCVAj9_o8GtL7ERtV0OVKC2OMKchnxoXL4iw_mekFbJU1BLOTp8830KbFfcpKtdbSidEjYaMT38ZPL8DX4i4kIa_lpRRw0d5Bve8XlSyRV3eRK1d__vJg_O-X7KL6jzAPxyvI2UMVM62h5wj8vwuiFn4EJ1aoNYYNVlwgESgecSvLZpl9PJ-8FYMZ2yZ38dC5tiiW4OgDhFv4au1UYV5Hw3ZrcBM5y7gEZ-KTIHdkkBLmB5xrLRhXWvhGE57QhLMm1YH0RBAyDe44ikA0DZF-orj9Y-PEqECL5j6qTmdTc4CwNNquK03oe5pSBuEcXmgndal4qIUU6SGqQU3G76UjxnhZjqO_b5-jrWjQ6467nf7DMdqG7ikpgyeommcLc2phPVdnRW9-AQK4o88
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+International+Parallel+and+Distributed+Processing+Symposium+Workshops+%28IPDPSW%29&rft.atitle=Hessenberg+Reduction+with+Transient+Error+Resilience+on+GPU-Based+Hybrid+Architectures&rft.au=Yulu+Jia&rft.au=Luszczek%2C+Piotr&rft.au=Dongarra%2C+Jack&rft.date=2016-05-01&rft.pub=IEEE&rft.spage=653&rft.epage=662&rft_id=info:doi/10.1109%2FIPDPSW.2016.34&rft.externalDocID=7529926