To repair or not to repair: Assessing fault resilience in MPI stencil applications

With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This tren...

Full description

Saved in:
Bibliographic Details
Published inJournal of parallel and distributed computing Vol. 205; p. 105156
Main Authors Rocco, Roberto, Boella, Elisabetta, Gregori, Daniele, Palermo, Gianluca
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.11.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:With the increasing size of HPC computations, faults are becoming more and more relevant in the HPC field. The MPI standard does not define the application behaviour after a fault, leaving the burden of fault management to the user, who usually resorts to checkpoint and restart mechanisms. This trend is especially true in stencil applications, as their regular pattern simplifies the selection of checkpoint locations. However, checkpoint and restart mechanisms introduce non-negligible overhead, disk load, and scalability concerns. In this paper, we show an alternative through fault resilience, enabled by the features provided by the User Level Fault Mitigation extension and shipped within the Legio fault resilience framework. Through fault resilience, we continue executing only the non-failed processes, thus sacrificing result accuracy for faster fault recovery. Our experiments on some specimen stencil applications show that, despite the fault impact visible in the result, we produced meaningful values usable for scientific research, proving the possibilities of a fault resilience approach in a stencil scenario. •Faults are becoming a critical issue in HPC executions, as MPI cannot handle them.•Checkpointing, while widespread, is time-consuming, disk demanding and poorly scalable.•Through fault resilience, we sacrifice result accuracy for faster recovery.•Experiments show that the loss of accuracy does not compromise result usability.
ISSN:0743-7315
DOI:10.1016/j.jpdc.2025.105156