A Lightweight Causal Message Logging Protocol to Lower Fault Tolerance Overhead

Rollback recovery is a trustworthy and key approach to fault tolerance in high performance computing and to parallel program debugging. In various rollback recovery protocols, causal message logging shows some desirable characteristics, but its high piggybacking overhead obstructs its applications,...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE International Conference on Cluster Computing (CLUSTER) pp. 392 - 401
Main Author Jin-Min Yang
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.09.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Rollback recovery is a trustworthy and key approach to fault tolerance in high performance computing and to parallel program debugging. In various rollback recovery protocols, causal message logging shows some desirable characteristics, but its high piggybacking overhead obstructs its applications, especially in large-scale distributed systems. Its high overhead arises from its conservation in the assumption on program execution model. This paper identifies the influence of non-deterministic message delivery on the correct outcome of a process, and then gives a scheme to relax the constraints from the piecewise deterministic execution model. Subsequently, a lightweight implementation of causal message logging is proposed to decrease the overhead of piggybacking and rolling forward. The experimental results of 3 NAS NPB2.3 benchmarks show that the proposed scheme achieves a significant improvement in the overhead reduction.
ISSN:2168-9253
DOI:10.1109/CLUSTER.2016.64