A faster checkpointing and recovery algorithm with a hierarchical storage approach

Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk wr...

Full description

Saved in:
Bibliographic Details
Published inEighth International Conference on High Performance Computing in Asia Pacific Region : proceedings : 30 November - 3 December, Beijing, China pp. 5 pp. - 402
Main Authors Gao, Wen, Mingyu Chen, Nanya, T.
Format Conference Proceeding
LanguageEnglish
Published IEEE 2005
Subjects
Online AccessGet full text
ISBN9780769524863
0769524869
DOI10.1109/HPCASIA.2005.2

Cover

Abstract Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk write overhead, this paper proposes a new diskless checkpointing and rollback recovery algorithm. Since the proposed algorithm does not need to calculate parity and write the checkpointing data into disk, it is analyzed to be a faster checkpointing algorithm than the original one. Based on comparison, the recovery time of the proposed algorithm is also less. However, the cluster can not tolerant multiple transient failure using this diskless checkpointing algorithm. To compensate this drawback, a hierarchical storage strategy is adopted. An experimental result shows that this diskless algorithm with a hierarchical storage approach is fast and effective.
AbstractList Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk write overhead, this paper proposes a new diskless checkpointing and rollback recovery algorithm. Since the proposed algorithm does not need to calculate parity and write the checkpointing data into disk, it is analyzed to be a faster checkpointing algorithm than the original one. Based on comparison, the recovery time of the proposed algorithm is also less. However, the cluster can not tolerant multiple transient failure using this diskless checkpointing algorithm. To compensate this drawback, a hierarchical storage strategy is adopted. An experimental result shows that this diskless algorithm with a hierarchical storage approach is fast and effective.
Author Mingyu Chen
Gao, Wen
Nanya, T.
Author_xml – sequence: 1
  givenname: Wen
  surname: Gao
  fullname: Gao, Wen
  email: gw@ncic.ac.cn
  organization: Inst. of Comput. Technol., Chinese Acad. of Sci., China
– sequence: 2
  surname: Mingyu Chen
  fullname: Mingyu Chen
  email: cmy@ncic.ac.cn
  organization: Inst. of Comput. Technol., Chinese Acad. of Sci., China
– sequence: 3
  givenname: T.
  surname: Nanya
  fullname: Nanya, T.
  email: nanya@hal.rcast.utokyo.ac.jp
BookMark eNotjM1KxDAYAAMqqGuvXrzkBbomadImx1LUXVhQ_DkvX5IvbbTblrQo-_Yu6BxmbnNNzodxQEJuOVtzzsz95qWp37b1WjCm1uKMZKbSrCqNElKXxSXJ5vmTnShMyQp9RV5rGmBeMFHXofuaxjgscWgpDJ4mdOM3piOFvh1TXLoD_TmZAu0iJkiuiw56Oi9jghYpTFMawXU35CJAP2P23xX5eHx4bzb57vlp29S7PPJKLblWYKXlUkhupRGcBwvec1kqxqSzQutCCCnQoK3QmxC8C84qGXwpkQUoVuTu7xsRcT-leIB03HNlhDCq-AWNiVFn
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/HPCASIA.2005.2
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EndPage 402
ExternalDocumentID 1592295
Genre orig-research
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AARBI
AAWTH
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
OCL
RIE
RIL
ID FETCH-LOGICAL-i175t-85ab4b14241b49211fbadd1465004cb28832242e9eb7ed9ffdcfcb54fd64e0fa3
IEDL.DBID RIE
ISBN 9780769524863
0769524869
IngestDate Wed Aug 27 02:43:26 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-85ab4b14241b49211fbadd1465004cb28832242e9eb7ed9ffdcfcb54fd64e0fa3
ParticipantIDs ieee_primary_1592295
PublicationCentury 2000
PublicationDate 20050000
PublicationDateYYYYMMDD 2005-01-01
PublicationDate_xml – year: 2005
  text: 20050000
PublicationDecade 2000
PublicationTitle Eighth International Conference on High Performance Computing in Asia Pacific Region : proceedings : 30 November - 3 December, Beijing, China
PublicationTitleAbbrev HPCASIA
PublicationYear 2005
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000396038
Score 1.3452364
Snippet Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism...
SourceID ieee
SourceType Publisher
StartPage 5 pp.
SubjectTerms Algorithm design and analysis
Checkpointing
Clustering algorithms
Computers
Concurrent computing
Detectors
Fault detection
Fault tolerant systems
High performance computing
Operating systems
Title A faster checkpointing and recovery algorithm with a hierarchical storage approach
URI https://ieeexplore.ieee.org/document/1592295
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEJ4AJ0-oYFTU7MGjhT62ryMhEjTBEJWEG9ltZ5WgLSHloL_e2W2L0Xjwtu1lN_uY93wfwLVMwsgPFOr8oGvxMBSWsCXSiLwehwxoGehu5OlDMJnz-4W_aMDNvhcGEU3xGfb10OTy0zzZ6VDZgFSvZp9uQpOuWdmrtY-n2B7Z4l5Ueuax7_IoiCuAnfrbq0AbHTseTGaj4dPdsAyquD-oVYxmGbdhWq-pLChZ93eF7Cefv-Aa_7voQ-h-9_Cx2V47HUEDs2No1yQOrHrTHXgcMiU0WgKj40vWm3xluCOYyFKmvWW66h9MvL3k21Xx-s503JYJphm0TQ6CjpjpAksSS6zGJ-_CfHz7PJpYFdGCtSLrobAiX0iuo0HckTwml1BJEnskQ316QonUhMSk6V2MUYaYxkqliUqkz1UacLSV8E6gleUZngIjlc-Vg1ynjjn5LpJjyoWiGexIhE56Bh29RctNiaWxrHbn_O_fPTgwUKkm5HEBrWK7w0syAgp5ZU7_C40grrg
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JT8JAFH5BPOgJFYy7c_Booct0OxIiqQqEKCTcyEw7owRtCSkH_fW-mbYYjQdv015mMsvb3_cB3PDYD1xPCpUftA3q-8xgJhc4Qq_HQgOae6obeTjyoil9mLmzGtxue2GEELr4TLTVUOfykyzeqFBZB1WvYp_egV3U-9QturW2ERXTQWvcCQrfPHRtGnhhCbFTfTslbKNlhp1o3Os-33eLsIr9g1xF65Z-A4bVqoqSkmV7k_N2_PkLsPG_yz6A1ncXHxlv9dMh1ER6BI2KxoGUr7oJT10imcJLIHiA8XKVLTR7BGFpQpS_jJf9g7C3l2y9yF_fiYrcEkYUh7bOQuAhE1ViiYKJVAjlLZj27ya9yCipFowF2g-5EbiMUxUPohanITqFkqPgQynq4iOKuaIkRl1vi1BwXyShlEksY-5SmXhUmJI5x1BPs1ScAEGlT6UlqEoeU_ReOBUJZRJnMAPmW8kpNNUWzVcFmsa83J2zv39fw140GQ7mg_vR4znsa-BUHQC5gHq-3ohLNAlyfqVvwhfkkrIF
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Eighth+International+Conference+on+High+Performance+Computing+in+Asia+Pacific+Region+%3A+proceedings+%3A+30+November+-+3+December%2C+Beijing%2C+China&rft.atitle=A+faster+checkpointing+and+recovery+algorithm+with+a+hierarchical+storage+approach&rft.au=Gao%2C+Wen&rft.au=Mingyu+Chen&rft.au=Nanya%2C+T.&rft.date=2005-01-01&rft.pub=IEEE&rft.isbn=9780769524863&rft.spage=5+pp.&rft.epage=402&rft_id=info:doi/10.1109%2FHPCASIA.2005.2&rft.externalDocID=1592295
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780769524863/lc.gif&client=summon&freeimage=true
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780769524863/mc.gif&client=summon&freeimage=true
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780769524863/sc.gif&client=summon&freeimage=true