An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL

Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods....

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on parallel and distributed systems Vol. 29; no. 4; pp. 758 - 771
Main Authors	Tang, Xiongchao, Zhai, Jidong, Yu, Bowen, Chen, Wenguang, Zheng, Weimin, Li, Keqin
Format	Journal Article
Language	English
Published	New York IEEE 01.04.2018 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Computer memory Encoding Fault tolerance Fault tolerant systems fault-tolerant HPL in-memory checkpoint Large-scale systems memory consumption Memory management Methods Random access memory Servers State of the art System reliability
Online Access	Get full text
ISSN	1045-9219 1558-2183
DOI	10.1109/TPDS.2017.2781257

Cover

Loading…

More Information
Summary:	Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk. In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50 percent. To validate our method, we apply self-checkpoint method to an important problem: High-Performance Linpack (HPL) with fault tolerance. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95 percent of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47 percent and the performance by 5 percent.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1045-9219 1558-2183
DOI:	10.1109/TPDS.2017.2781257