Checkpointing strategies for a fixed-length execution
This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the di...
Saved in:
Published in | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis pp. 508 - 518 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
17.11.2024
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/SCW63240.2024.00072 |
Cover
Loading…
Abstract | This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the difficulty of the problem: it turns out that the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left; we determine threshold times T n such that it is best to plan for exactly n checkpoints if the time left (or initially the length of the reservation) is between T n and T n+1 . Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy. Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations. |
---|---|
AbstractList | This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The application executes for a fixed duration, namely the length of the reservation that it has been granted. We start with small examples that show the difficulty of the problem: it turns out that the optimal checkpointing strategy neither always uses periodic checkpoints nor always takes its last checkpoint exactly at the end of the reservation. Then, we introduce a dynamic heuristic that is periodic and decides for the checkpointing frequency based upon thresholds for the time left; we determine threshold times T n such that it is best to plan for exactly n checkpoints if the time left (or initially the length of the reservation) is between T n and T n+1 . Next, we use time discretization and design a (complicated) dynamic programming algorithm that computes the optimal solution, without any restriction on the checkpointing strategy. Finally, we report the results of an extensive simulation campaign that shows that the optimal solution is far more efficient than the Young/Daly periodic approach for short or mid-size reservations. |
Author | Benoit, Anne Robert, Yves Perotin, Lucas Vivien, Frederic |
Author_xml | – sequence: 1 givenname: Anne surname: Benoit fullname: Benoit, Anne email: anne.benoit@ens-lyon.fr organization: Institut Universitaire de France,Laboratoire LIP, ENS Lyon & Inria, Lyon, France – sequence: 2 givenname: Lucas surname: Perotin fullname: Perotin, Lucas email: lucas.perotin@ens-lyon.fr organization: Vanderbilt University,Laboratoire LIP, ENS Lyon & Inria, Lyon, France,TN,USA – sequence: 3 givenname: Yves surname: Robert fullname: Robert, Yves email: yves.robert@ens-lyon.fr organization: Laboratoire LIP, ENS Lyon & Inria,Lyon,France – sequence: 4 givenname: Frederic surname: Vivien fullname: Vivien, Frederic email: frederic.vivien@inria.fr organization: Laboratoire LIP, ENS Lyon & Inria,Lyon,France |
BookMark | eNotj8tKxDAUQCMoqGO_QBf9gdab3Ly6lOILBlyouByS9KYTHNOhjTD-vQO6OpvDgXPJTvOUibFrDi3n0N2-9h8ahYRWgJAtABhxwqrOdBYVoFJK4jmrliV50KCsBKsumOq3FD73U8ol5bFeyuwKjYmWOk5z7eqYDjQ0O8pj2dZ0oPBd0pSv2Fl0u4Wqf67Y-8P9W__UrF8en_u7deMQeGm4F2GQXPvgpEEtHEqKSsUB-NAp9Cai1-gdBWW9kdIeTRJBy2ii1k7iit38dRMRbfZz-nLzz4aDFccBjr_Ew0d2 |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/SCW63240.2024.00072 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 9798350355543 |
EndPage | 518 |
ExternalDocumentID | 10820581 |
Genre | orig-research |
GroupedDBID | 6IE 6IL ACM ALMA_UNASSIGNED_HOLDINGS CBEJK RIE RIL |
ID | FETCH-LOGICAL-a301t-1b2cd416bca47362a34ef55fd01d953b7f3b63baec58b74486bce2c64f7f66a43 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 01:59:32 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a301t-1b2cd416bca47362a34ef55fd01d953b7f3b63baec58b74486bce2c64f7f66a43 |
OpenAccessLink | https://inria.hal.science/hal-04668191/document |
PageCount | 11 |
ParticipantIDs | ieee_primary_10820581 |
PublicationCentury | 2000 |
PublicationDate | 2024-Nov.-17 |
PublicationDateYYYYMMDD | 2024-11-17 |
PublicationDate_xml | – month: 11 year: 2024 text: 2024-Nov.-17 day: 17 |
PublicationDecade | 2020 |
PublicationTitle | SC24-W: Workshops of the International Conference for High Performance Computing, Networking, Storage and Analysis |
PublicationTitleAbbrev | SC-W |
PublicationYear | 2024 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssib060584085 |
Score | 1.8921689 |
Snippet | This work considers checkpointing strategies for a parallel application executing on a large-scale platform whose nodes are subject to failures. The... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 508 |
SubjectTerms | Checkpointing Conferences Dynamic programming Error analysis fail-stop errors failures fixed-size reservation fixedlength execution Heuristic algorithms High performance computing Linear algebra Time-frequency analysis |
Title | Checkpointing strategies for a fixed-length execution |
URI | https://ieeexplore.ieee.org/document/10820581 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA3akycVK36Tg9fU3U2y2ZwXSxEsghZ7K_kaWgpt0S0Uf72ZtKsiCN5CCGySTXgzmXlvCLnVUHJTFI4pp4EJry3TPnopzjkwIJHbiBHdx2E5GImHsRzvyOqJCxNCSMlnoYfNFMv3S7fGp7J4wyNeSSRa70fPbUvWag8PhvdQrWunLJRn-u65fkUx8ix6gQVqZGcoA_yjhkqCkP4hGbYf32aOzHvrxvbcxy9dxn_P7oh0v9l69OkLh47JXlicEFlPg5uvlrNUCYK-N60kBI1WKjUUZpvgGZZRaaY0bIJLB7BLRv37l3rAdiUSmIk3s2G5LZyPNpV1RqiIRYaLAFKCz3KvJbcKuC25NcHJyqroisWRoXClAAVlaQQ_JZ3FchHOCI2GUJFzK20lQFSY16e90b6CSjrIFZyTLq55stqqYEza5V780X9JDnDfkbeXqyvSad7W4ToCeGNv0o_7BAaNmxY |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA1SD3pSseK3OXjdutkkm825WKq2RbDF3ko-aSm0RXeh-OvNpF0VQfAWQiAJSXgzmXlvELqVPqcqy0wijPQJs1In0gYvxRjjlefAbYSIbn-Qd0fscczHW7J65MI452LymWtBM8by7dJU8FUWXnjAKw5E690A_Jxs6Fr19YEAH-h1bbWFSCrvXtqvIEeeBj8wA5XsFISAf1RRiSDSOUCDevpN7si8VZW6ZT5-KTP-e32HqPnN18PPX0h0hHbc4hjx9tSZ-Wo5i7Ug8HtZi0LgYKdihf1s7WwChVTKKXZrZ-IVbKJR537Y7ibbIgmJCm-zTIjOjA1WlTaKiYBGijLnOfc2JVZyqoWnOqdaOcMLLYIzFka6zOTMC5_nitET1FgsF-4U4WAKZYRqrgvmWQGZfdIqaQtfcOOJ8GeoCXuerDY6GJN6u-d_9N-gve6w35v0HgZPF2gfzgBYfERcokb5VrmrAOelvo6H-AlSDp5f |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC24-W%3A+Workshops+of+the+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Checkpointing+strategies+for+a+fixed-length+execution&rft.au=Benoit%2C+Anne&rft.au=Perotin%2C+Lucas&rft.au=Robert%2C+Yves&rft.au=Vivien%2C+Frederic&rft.date=2024-11-17&rft.pub=IEEE&rft.spage=508&rft.epage=518&rft_id=info:doi/10.1109%2FSCW63240.2024.00072&rft.externalDocID=10820581 |