Hybrid CPU/GPU Checkpoint for GPU-Based Heterogeneous Systems
Fault tolerance has become a major concern in exascale computing, especially for the large scale CPU/GPU heterogeneous clusters. The performance/cost benefit of GPU based system is subject to their abilities to provide high reliability, availability, and serviceability. The traditional CPU-based che...
Saved in:
Published in | Parallel Computational Fluid Dynamics pp. 470 - 481 |
---|---|
Main Authors | , , |
Format | Book Chapter |
Language | English |
Published |
Berlin, Heidelberg
Springer Berlin Heidelberg
2014
|
Series | Communications in Computer and Information Science |
Subjects | |
Online Access | Get full text |
ISBN | 3642539610 9783642539619 |
ISSN | 1865-0929 1865-0937 |
DOI | 10.1007/978-3-642-53962-6_42 |
Cover
Summary: | Fault tolerance has become a major concern in exascale computing, especially for the large scale CPU/GPU heterogeneous clusters. The performance/cost benefit of GPU based system is subject to their abilities to provide high reliability, availability, and serviceability. The traditional CPU-based checkpoint technologies have been deployed on the GPU platform but all of them treat the GPU as a second class controllable and shared entity. As existing GPU checkpoint/restart implementations do not support checkpointing the internal GPU status, the codes running on GPU (kernel) can not be checked/restored just like the CPU codes, all the checkpoint operation is done outside the kernel. In this paper, we propose a hybrid checkpoint technology, HKC (Hybrid Kernel Checkpoint). HKC combines the PTX stub inject technology and dynamic library hijack mechanism, to save/store the internal state of a GPU kernel. Our evaluation shows that HKC increases the system reliability of CPU/GPU hybrid system with a very reasonable cost, and show more resilience than other checkpoint scheme. |
---|---|
ISBN: | 3642539610 9783642539619 |
ISSN: | 1865-0929 1865-0937 |
DOI: | 10.1007/978-3-642-53962-6_42 |