Checkpointing strategies for a fixed-length execution

This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the di...

Full description

Saved in:
Bibliographic Details
Published inSC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 508 - 518
Main Authors Benoit, Anne, Perotin, Lucas, Robert, Yves, Vivien, Frederic
Format Conference Proceeding
LanguageEnglish
Published IEEE 17.11.2024
Subjects
Online AccessGet full text
DOI10.1109/SCW63240.2024.00072

Cover

Loading…
Abstract This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the difficulty of the problem: it turns out that the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left; we determine threshold times T n such that it is best to plan for exactly n checkpoints if the time left (or initially the length of the reservation) is between T n and T n+1 . Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy. Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations.
AbstractList This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the difficulty of the problem: it turns out that the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left; we determine threshold times T n such that it is best to plan for exactly n checkpoints if the time left (or initially the length of the reservation) is between T n and T n+1 . Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy. Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations.
Author Benoit, Anne
Robert, Yves
Perotin, Lucas
Vivien, Frederic
Author_xml – sequence: 1
  givenname: Anne
  surname: Benoit
  fullname: Benoit, Anne
  email: anne.benoit@ens-lyon.fr
  organization: Institut Universitaire de France,Laboratoire LIP, ENS Lyon & Inria, Lyon, France
– sequence: 2
  givenname: Lucas
  surname: Perotin
  fullname: Perotin, Lucas
  email: lucas.perotin@ens-lyon.fr
  organization: Vanderbilt University,Laboratoire LIP, ENS Lyon & Inria, Lyon, France,TN,USA
– sequence: 3
  givenname: Yves
  surname: Robert
  fullname: Robert, Yves
  email: yves.robert@ens-lyon.fr
  organization: Laboratoire LIP, ENS Lyon & Inria,Lyon,France
– sequence: 4
  givenname: Frederic
  surname: Vivien
  fullname: Vivien, Frederic
  email: frederic.vivien@inria.fr
  organization: Laboratoire LIP, ENS Lyon & Inria,Lyon,France
BookMark eNotj8tKxDAUQCMoqGO_QBf9gdab3Ly6lOILBlyouByS9KYTHNOhjTD-vQO6OpvDgXPJTvOUibFrDi3n0N2-9h8ahYRWgJAtABhxwqrOdBYVoFJK4jmrliV50KCsBKsumOq3FD73U8ol5bFeyuwKjYmWOk5z7eqYDjQ0O8pj2dZ0oPBd0pSv2Fl0u4Wqf67Y-8P9W__UrF8en_u7deMQeGm4F2GQXPvgpEEtHEqKSsUB-NAp9Cai1-gdBWW9kdIeTRJBy2ii1k7iit38dRMRbfZz-nLzz4aDFccBjr_Ew0d2
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SCW63240.2024.00072
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 9798350355543
EndPage 518
ExternalDocumentID 10820581
Genre orig-research
GroupedDBID 6IE
6IL
ACM
ALMA_UNASSIGNED_HOLDINGS
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a301t-1b2cd416bca47362a34ef55fd01d953b7f3b63baec58b74486bce2c64f7f66a43
IEDL.DBID RIE
IngestDate Wed Aug 27 01:59:32 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a301t-1b2cd416bca47362a34ef55fd01d953b7f3b63baec58b74486bce2c64f7f66a43
OpenAccessLink https://inria.hal.science/hal-04668191/document
PageCount 11
ParticipantIDs ieee_primary_10820581
PublicationCentury 2000
PublicationDate 2024-Nov.-17
PublicationDateYYYYMMDD 2024-11-17
PublicationDate_xml – month: 11
  year: 2024
  text: 2024-Nov.-17
  day: 17
PublicationDecade 2020
PublicationTitle SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC-W
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssib060584085
Score 1.8921689
Snippet This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The...
SourceID ieee
SourceType Publisher
StartPage 508
SubjectTerms Checkpointing
Conferences
Dynamic programming
Error analysis
fail-stop errors
failures
fixed-size reservation
fixedlength execution
Heuristic algorithms
High performance computing
Linear algebra
Time-frequency analysis
Title Checkpointing strategies for a fixed-length execution
URI https://ieeexplore.ieee.org/document/10820581
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA3akycVK36Tg9fU3U2y2ZwXSxEsghZ7K_kaWgpt0S0Uf72ZtKsiCN5CCGySTXgzmXlvCLnVUHJTFI4pp4EJry3TPnopzjkwIJHbiBHdx2E5GImHsRzvyOqJCxNCSMlnoYfNFMv3S7fGp7J4wyNeSSRa70fPbUvWag8PhvdQrWunLJRn-u65fkUx8ix6gQVqZGcoA_yjhkqCkP4hGbYf32aOzHvrxvbcxy9dxn_P7oh0v9l69OkLh47JXlicEFlPg5uvlrNUCYK-N60kBI1WKjUUZpvgGZZRaaY0bIJLB7BLRv37l3rAdiUSmIk3s2G5LZyPNpV1RqiIRYaLAFKCz3KvJbcKuC25NcHJyqroisWRoXClAAVlaQQ_JZ3FchHOCI2GUJFzK20lQFSY16e90b6CSjrIFZyTLq55stqqYEza5V780X9JDnDfkbeXqyvSad7W4ToCeGNv0o_7BAaNmxY
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1SD3pSseK3OXjdutkkm825WKq2RbDF3ko-aSm0RXeh-OvNpF0VQfAWQiAJSXgzmXlvELqVPqcqy0wijPQJs1In0gYvxRjjlefAbYSIbn-Qd0fscczHW7J65MI452LymWtBM8by7dJU8FUWXnjAKw5E690A_Jxs6Fr19YEAH-h1bbWFSCrvXtqvIEeeBj8wA5XsFISAf1RRiSDSOUCDevpN7si8VZW6ZT5-KTP-e32HqPnN18PPX0h0hHbc4hjx9tSZ-Wo5i7Ug8HtZi0LgYKdihf1s7WwChVTKKXZrZ-IVbKJR537Y7ibbIgmJCm-zTIjOjA1WlTaKiYBGijLnOfc2JVZyqoWnOqdaOcMLLYIzFka6zOTMC5_nitET1FgsF-4U4WAKZYRqrgvmWQGZfdIqaQtfcOOJ8GeoCXuerDY6GJN6u-d_9N-gve6w35v0HgZPF2gfzgBYfERcokb5VrmrAOelvo6H-AlSDp5f
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Checkpointing+strategies+for+a+fixed-length+execution&rft.au=Benoit%2C+Anne&rft.au=Perotin%2C+Lucas&rft.au=Robert%2C+Yves&rft.au=Vivien%2C+Frederic&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=508&rft.epage=518&rft_id=info:doi/10.1109%2FSCW63240.2024.00072&rft.externalDocID=10820581