A Lightweight Causal Message Logging Protocol to Lower Fault Tolerance Overhead
Rollback recovery is a trustworthy and key approach to fault tolerance in high performance computing and to parallel program debugging. In various rollback recovery protocols, causal message logging shows some desirable characteristics, but its high piggybacking overhead obstructs its applications,...
Saved in:
Published in | 2016 IEEE International Conference on Cluster Computing (CLUSTER) pp. 392 - 401 |
---|---|
Main Author | |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.09.2016
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Rollback recovery is a trustworthy and key approach to fault tolerance in high performance computing and to parallel program debugging. In various rollback recovery protocols, causal message logging shows some desirable characteristics, but its high piggybacking overhead obstructs its applications, especially in large-scale distributed systems. Its high overhead arises from its conservation in the assumption on program execution model. This paper identifies the influence of non-deterministic message delivery on the correct outcome of a process, and then gives a scheme to relax the constraints from the piecewise deterministic execution model. Subsequently, a lightweight implementation of causal message logging is proposed to decrease the overhead of piggybacking and rolling forward. The experimental results of 3 NAS NPB2.3 benchmarks show that the proposed scheme achieves a significant improvement in the overhead reduction. |
---|---|
ISSN: | 2168-9253 |
DOI: | 10.1109/CLUSTER.2016.64 |