A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing

Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures...

Full description

Saved in:
Bibliographic Details
Published in2008 11th IEEE High Assurance Systems Engineering Symposium pp. 71 - 79
Main Authors Zizhong Chen, Dongarra, J.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2008
Subjects
Online AccessGet full text
ISBN0769534821
9780769534824
ISSN1530-2059
DOI10.1109/HASE.2008.13

Cover

Abstract Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[logp].k((beta + 2gamma)m + alpha) to (1 + O(1/radic(m))).k(beta + 2gamma)m, where a is the communication latency, 1/beta is the network bandwidth between processes, 1/gamma is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.
AbstractList Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable storage. In this paper, we introduce several scalable encoding strategies into diskless checkpointing and reduce the overhead to survive k failures in p processes from 2[logp].k((beta + 2gamma)m + alpha) to (1 + O(1/radic(m))).k(beta + 2gamma)m, where a is the communication latency, 1/beta is the network bandwidth between processes, 1/gamma is the rate to perform calculations, and m is the size of local checkpoint per process. The introduced algorithm is scalable in the sense that the overhead to survive k failures in p processes does not increase as the number of processes p increases. We evaluate the performance overhead of the introduced algorithm by using a preconditioned conjugate gradient equation solver as an example. Experimental results demonstrate that the introduced techniques are highly scalable.
Author Dongarra, J.
Zizhong Chen
Author_xml – sequence: 1
  surname: Zizhong Chen
  fullname: Zizhong Chen
  organization: Dept. of Math. & Comput. Sci. Golden, Colorado Sch. of Mines, Golden, CO
– sequence: 2
  givenname: J.
  surname: Dongarra
  fullname: Dongarra, J.
  organization: Dept. of Electr. Eng. & Comput. Sci. Knoxville, Univ. of Tennessee, Knoxville, TN
BookMark eNpNzLtOwzAUgGFLFIm2dGNj8QsknGPHtzEKoUWqxFCYK9eX1jRNqjgLbw8SDEz_8ulfkFk_9IGQB4QSEczTpt61JQPQJfIbsgAljeCVZjgjcxQcCgbC3JFVzp8AgEYq4DgnTU13znb20AXanII7X4fUT7Tt3eBTf6R1dxzGNJ0uNA4jfU753IWc_9EfdE9uo-1yWP11ST5e2vdmU2zf1q9NvS0SKjEVOhqIB-G1i4Ep8Mils9w6G5UIwpngpdOce2UrcIjMC5CiEkJKbSyLii_J4-83hRD21zFd7Pi1rxRoLQX_BnBASy4
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/HASE.2008.13
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EndPage 79
ExternalDocumentID 4708865
Genre orig-research
GroupedDBID 29G
29H
29N
29O
6IE
6IF
6IH
6IK
6IL
6IN
AAJGR
AAWTH
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i175t-8f90fb5d8cfe270d136ca3acaf75e5c9ed6c833d7a40c112d50654556689a2f73
IEDL.DBID RIE
ISBN 0769534821
9780769534824
ISSN 1530-2059
IngestDate Wed Aug 27 02:12:22 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-8f90fb5d8cfe270d136ca3acaf75e5c9ed6c833d7a40c112d50654556689a2f73
PageCount 9
ParticipantIDs ieee_primary_4708865
PublicationCentury 2000
PublicationDate 2008-Dec.
PublicationDateYYYYMMDD 2008-12-01
PublicationDate_xml – month: 12
  year: 2008
  text: 2008-Dec.
PublicationDecade 2000
PublicationTitle 2008 11th IEEE High Assurance Systems Engineering Symposium
PublicationTitleAbbrev HASE
PublicationYear 2008
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0001967031
ssj0008135
Score 1.7870263
Snippet Diskless checkpointing is an efficient technique to save the state of a long running application in a distributed environment without relying on stable...
SourceID ieee
SourceType Publisher
StartPage 71
SubjectTerms Bandwidth
Checkpoint
Checkpointing
Contracts
Delay
diskless checkpointing
Encoding
Fault tolerance
High performance computing
parallel and distributed systems
Reed-Solomon encoding
Scalability
Systems engineering and theory
USA Councils
Title A Scalable Checkpoint Encoding Algorithm for Diskless Checkpointing
URI https://ieeexplore.ieee.org/document/4708865
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NT8IwFH8BTnpBAeN3evDoYKPtth4JQogJxkRJuJGuH7KAG8Fx8a-33cYgxoO3fXRL1762v7793u8BPEhNIumGzBF9TRxCJHOYy5SDscSe9JgMuPVDTl_8yYw8z-m8Bo9VLIxSKiefqa49zP_ly1TsrKusRwIzJnxah7oxsyJW6-BPYb6VYq9m4bBIrmkGtGssgbJiy86oFXPxSuWd_TmpGPGsNxm8jQqKpc14cJRxJV9wxk2Y7qta8ExW3V0WdcX3LxXH_37LGXQOoX3otVq0zqGmkhY097kdUDnUW3B6JFTYhuHA3OFrG2aFhkslVps0TjI0SkRqX4MG6490G2fLT2QwMHqKv1ZrM4MeFTWFOjAbj96HE6dMv-DEBlNkTqiZqyMqQ6FVP3Clh33BMRdcB1RRwZT0RWj61PSmKwxsk9QGqlKDD0PG-zrAF9BI0kRdApJYqRBHzDxhERvnkRaEEU8HWrtmh3MFbdtCi02hsLEoG-f678s3cJKzNnJSyS00su1O3RlokEX3uU38AHeGsc8
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NT8IwFH9BPKgXFDB-u4NHBxtrt_VIEIIKxERIuJGuH7KAG8Fx8a-33cYgxoO3fXRL1_a1v7793u8BPHCJAm75xGQtiUyEODGJRYTpONyxuU24R7Ufcjhy-xP0MsXTEjwWsTBCiJR8Jhr6MP2Xz2O20a6yJvKUTbj4AA7Vuo9wFq2186gQV4uxF_Own6XXVCZtqbGASbZpJ1jLudi59s72HBWceNLst9-7GclS5zzYy7mSLjm9Cgy3lc2YJovGJgka7PuXjuN_v-YU6rvgPuOtWLbOoCSiKlS22R2M3NircLInVViDTlvdoUsdaGV05oItVnEYJUY3YrF-jdFefsTrMJl_GgoFG0_h12Kp5tC9oqpQHSa97rjTN_MEDGaoUEVi-pJYMsDcZ1K0PIvbjsuoQxmVHhaYEcFd5qteVf1pMQXcONahqlghRJ_QlvSccyhHcSQuwOCOEL4TEPWExmyUBpIhgmzpSWmpPc4l1HQLzVaZxsYsb5yrvy_fw1F_PBzMBs-j12s4TjkcKcXkBsrJeiNuFVBIgrt0fPwA_US1HA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2008+11th+IEEE+High+Assurance+Systems+Engineering+Symposium&rft.atitle=A+Scalable+Checkpoint+Encoding+Algorithm+for+Diskless+Checkpointing&rft.au=Zizhong+Chen&rft.au=Dongarra%2C+J.&rft.date=2008-12-01&rft.pub=IEEE&rft.isbn=9780769534824&rft.issn=1530-2059&rft.spage=71&rft.epage=79&rft_id=info:doi/10.1109%2FHASE.2008.13&rft.externalDocID=4708865
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1530-2059&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1530-2059&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1530-2059&client=summon