Checkpointing strategies for a fixed-length execution

This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the di...

Full description

Saved in:
Bibliographic Details
Published inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 508 - 518
Main Authors Benoit, Anne, Perotin, Lucas, Robert, Yves, Vivien, Frederic
Format Conference Proceeding
LanguageEnglish
Published IEEE 17.11.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the difficulty of the problem: it turns out that the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left; we determine threshold times T n such that it is best to plan for exactly n checkpoints if the time left (or initially the length of the reservation) is between T n and T n+1 . Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy. Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations.
DOI:10.1109/SCW63240.2024.00072