Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems

Fault tolerance has become a major concern in exascale computing, especially for the large scale CPU/GPU heterogeneous clusters. The performance/cost benefit of GPU based system is subject to their abilities to provide high reliability, availability, and serviceability. The traditional CPU-based che...

Full description

Saved in:
Bibliographic Details
Published inParallel Computational Fluid Dynamics pp. 470 - 481
Main Authors Shi, Lin, Chen, Hao, Li, Ting
Format Book Chapter
LanguageEnglish
Published Berlin, Heidelberg Springer Berlin Heidelberg 2014
SeriesCommunications in Computer and Information Science
Subjects
Online AccessGet full text
ISBN3642539610
9783642539619
ISSN1865-0929
1865-0937
DOI10.1007/978-3-642-53962-6_42

Cover

More Information
Summary:Fault tolerance has become a major concern in exascale computing, especially for the large scale CPU/GPU heterogeneous clusters. The performance/cost benefit of GPU based system is subject to their abilities to provide high reliability, availability, and serviceability. The traditional CPU-based checkpoint technologies have been deployed on the GPU platform but all of them treat the GPU as a second class controllable and shared entity. As existing GPU checkpoint/restart implementations do not support checkpointing the internal GPU status, the codes running on GPU (kernel) can not be checked/restored just like the CPU codes, all the checkpoint operation is done outside the kernel. In this paper, we propose a hybrid checkpoint technology, HKC (Hybrid Kernel Checkpoint). HKC combines the PTX stub inject technology and dynamic library hijack mechanism, to save/store the internal state of a GPU kernel. Our evaluation shows that HKC increases the system reliability of CPU/GPU hybrid system with a very reasonable cost, and show more resilience than other checkpoint scheme.
ISBN:3642539610
9783642539619
ISSN:1865-0929
1865-0937
DOI:10.1007/978-3-642-53962-6_42