A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing
Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures...
Saved in:
Published in | 2008 11th IEEE High Assurance Systems Engineering Symposium pp. 71 - 79 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2008
|
Subjects | |
Online Access | Get full text |
ISBN | 0769534821 9780769534824 |
ISSN | 1530-2059 |
DOI | 10.1109/HASE.2008.13 |
Cover
Abstract | Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[logp].k((beta + 2gamma)m + alpha) to (1 + O(1/radic(m))).k(beta + 2gamma)m, where a is the communication latency, 1/beta is the network bandwidth between processes, 1/gamma is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable. |
---|---|
AbstractList | Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[logp].k((beta + 2gamma)m + alpha) to (1 + O(1/radic(m))).k(beta + 2gamma)m, where a is the communication latency, 1/beta is the network bandwidth between processes, 1/gamma is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable. |
Author | Dongarra, J. Zizhong Chen |
Author_xml | – sequence: 1 surname: Zizhong Chen fullname: Zizhong Chen organization: Dept. of Math. & Comput. Sci. Golden, Colorado Sch. of Mines, Golden, CO – sequence: 2 givenname: J. surname: Dongarra fullname: Dongarra, J. organization: Dept. of Electr. Eng. & Comput. Sci. Knoxville, Univ. of Tennessee, Knoxville, TN |
BookMark | eNpNzLtOwzAUgGFLFIm2dGNj8QsknGPHtzEKoUWqxFCYK9eX1jRNqjgLbw8SDEz_8ulfkFk_9IGQB4QSEczTpt61JQPQJfIbsgAljeCVZjgjcxQcCgbC3JFVzp8AgEYq4DgnTU13znb20AXanII7X4fUT7Tt3eBTf6R1dxzGNJ0uNA4jfU753IWc_9EfdE9uo-1yWP11ST5e2vdmU2zf1q9NvS0SKjEVOhqIB-G1i4Ep8Mils9w6G5UIwpngpdOce2UrcIjMC5CiEkJKbSyLii_J4-83hRD21zFd7Pi1rxRoLQX_BnBASy4 |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/HASE.2008.13 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering Computer Science |
EndPage | 79 |
ExternalDocumentID | 4708865 |
Genre | orig-research |
GroupedDBID | 29G 29H 29N 29O 6IE 6IF 6IH 6IK 6IL 6IN AAJGR AAWTH ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL RNS |
ID | FETCH-LOGICAL-i175t-8f90fb5d8cfe270d136ca3acaf75e5c9ed6c833d7a40c112d50654556689a2f73 |
IEDL.DBID | RIE |
ISBN | 0769534821 9780769534824 |
ISSN | 1530-2059 |
IngestDate | Wed Aug 27 02:12:22 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i175t-8f90fb5d8cfe270d136ca3acaf75e5c9ed6c833d7a40c112d50654556689a2f73 |
PageCount | 9 |
ParticipantIDs | ieee_primary_4708865 |
PublicationCentury | 2000 |
PublicationDate | 2008-Dec. |
PublicationDateYYYYMMDD | 2008-12-01 |
PublicationDate_xml | – month: 12 year: 2008 text: 2008-Dec. |
PublicationDecade | 2000 |
PublicationTitle | 2008 11th IEEE High Assurance Systems Engineering Symposium |
PublicationTitleAbbrev | HASE |
PublicationYear | 2008 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0001967031 ssj0008135 |
Score | 1.7870263 |
Snippet | Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 71 |
SubjectTerms | Bandwidth Checkpoint Checkpointing Contracts Delay diskless checkpointing Encoding Fault tolerance High performance computing parallel and distributed systems Reed-Solomon encoding Scalability Systems engineering and theory USA Councils |
Title | A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing |
URI | https://ieeexplore.ieee.org/document/4708865 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NT8IwFH8BTnpBAeN3evDoYKPtth4JQogJxkRJuJGuH7KAG8Fx8a-33cYgxoO3fXRL1762v7793u8BPEhNIumGzBF9TRxCJHOYy5SDscSe9JgMuPVDTl_8yYw8z-m8Bo9VLIxSKiefqa49zP_ly1TsrKusRwIzJnxah7oxsyJW6-BPYb6VYq9m4bBIrmkGtGssgbJiy86oFXPxSuWd_TmpGPGsNxm8jQqKpc14cJRxJV9wxk2Y7qta8ExW3V0WdcX3LxXH_37LGXQOoX3otVq0zqGmkhY097kdUDnUW3B6JFTYhuHA3OFrG2aFhkslVps0TjI0SkRqX4MG6490G2fLT2QwMHqKv1ZrM4MeFTWFOjAbj96HE6dMv-DEBlNkTqiZqyMqQ6FVP3Clh33BMRdcB1RRwZT0RWj61PSmKwxsk9QGqlKDD0PG-zrAF9BI0kRdApJYqRBHzDxhERvnkRaEEU8HWrtmh3MFbdtCi02hsLEoG-f678s3cJKzNnJSyS00su1O3RlokEX3uU38AHeGsc8 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NT8IwFH9BPKgXFDB-u4NHBxtrt_VIEIIKxERIuJGuH7KAG8Fx8a-33cYgxoO3fXRL1_a1v7793u8BPHCJAm75xGQtiUyEODGJRYTpONyxuU24R7Ufcjhy-xP0MsXTEjwWsTBCiJR8Jhr6MP2Xz2O20a6yJvKUTbj4AA7Vuo9wFq2186gQV4uxF_Own6XXVCZtqbGASbZpJ1jLudi59s72HBWceNLst9-7GclS5zzYy7mSLjm9Cgy3lc2YJovGJgka7PuXjuN_v-YU6rvgPuOtWLbOoCSiKlS22R2M3NircLInVViDTlvdoUsdaGV05oItVnEYJUY3YrF-jdFefsTrMJl_GgoFG0_h12Kp5tC9oqpQHSa97rjTN_MEDGaoUEVi-pJYMsDcZ1K0PIvbjsuoQxmVHhaYEcFd5qteVf1pMQXcONahqlghRJ_QlvSccyhHcSQuwOCOEL4TEPWExmyUBpIhgmzpSWmpPc4l1HQLzVaZxsYsb5yrvy_fw1F_PBzMBs-j12s4TjkcKcXkBsrJeiNuFVBIgrt0fPwA_US1HA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2008+11th+IEEE+High+Assurance+Systems+Engineering+Symposium&rft.atitle=A+Scalable+Checkpoint+Encoding+Algorithm+for+Diskless+Checkpointing&rft.au=Zizhong+Chen&rft.au=Dongarra%2C+J.&rft.date=2008-12-01&rft.pub=IEEE&rft.isbn=9780769534824&rft.issn=1530-2059&rft.spage=71&rft.epage=79&rft_id=info:doi/10.1109%2FHASE.2008.13&rft.externalDocID=4708865 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1530-2059&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1530-2059&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1530-2059&client=summon |