A faster checkpointing and recovery algorithm with a hierarchical storage approach
Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk wr...
Saved in:
Published in | Eighth International Conference on High Performance Computing in Asia Pacific Region : proceedings : 30 November - 3 December, Beijing, China pp. 5 pp. - 402 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
2005
|
Subjects | |
Online Access | Get full text |
ISBN | 9780769524863 0769524869 |
DOI | 10.1109/HPCASIA.2005.2 |
Cover
Abstract | Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk write overhead, this paper proposes a new diskless checkpointing and rollback recovery algorithm. Since the proposed algorithm does not need to calculate parity and write the checkpointing data into disk, it is analyzed to be a faster checkpointing algorithm than the original one. Based on comparison, the recovery time of the proposed algorithm is also less. However, the cluster can not tolerant multiple transient failure using this diskless checkpointing algorithm. To compensate this drawback, a hierarchical storage strategy is adopted. An experimental result shows that this diskless algorithm with a hierarchical storage approach is fast and effective. |
---|---|
AbstractList | Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism and watch-dog timer detector for fault tolerance. In the checkpointing algorithm in Score, disk write is the bottleneck. To eliminate disk write overhead, this paper proposes a new diskless checkpointing and rollback recovery algorithm. Since the proposed algorithm does not need to calculate parity and write the checkpointing data into disk, it is analyzed to be a faster checkpointing algorithm than the original one. Based on comparison, the recovery time of the proposed algorithm is also less. However, the cluster can not tolerant multiple transient failure using this diskless checkpointing algorithm. To compensate this drawback, a hierarchical storage strategy is adopted. An experimental result shows that this diskless algorithm with a hierarchical storage approach is fast and effective. |
Author | Mingyu Chen Gao, Wen Nanya, T. |
Author_xml | – sequence: 1 givenname: Wen surname: Gao fullname: Gao, Wen email: gw@ncic.ac.cn organization: Inst. of Comput. Technol., Chinese Acad. of Sci., China – sequence: 2 surname: Mingyu Chen fullname: Mingyu Chen email: cmy@ncic.ac.cn organization: Inst. of Comput. Technol., Chinese Acad. of Sci., China – sequence: 3 givenname: T. surname: Nanya fullname: Nanya, T. email: nanya@hal.rcast.utokyo.ac.jp |
BookMark | eNotjM1KxDAYAAMqqGuvXrzkBbomadImx1LUXVhQ_DkvX5IvbbTblrQo-_Yu6BxmbnNNzodxQEJuOVtzzsz95qWp37b1WjCm1uKMZKbSrCqNElKXxSXJ5vmTnShMyQp9RV5rGmBeMFHXofuaxjgscWgpDJ4mdOM3piOFvh1TXLoD_TmZAu0iJkiuiw56Oi9jghYpTFMawXU35CJAP2P23xX5eHx4bzb57vlp29S7PPJKLblWYKXlUkhupRGcBwvec1kqxqSzQutCCCnQoK3QmxC8C84qGXwpkQUoVuTu7xsRcT-leIB03HNlhDCq-AWNiVFn |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/HPCASIA.2005.2 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EndPage | 402 |
ExternalDocumentID | 1592295 |
Genre | orig-research |
GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR AARBI AAWTH ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK OCL RIE RIL |
ID | FETCH-LOGICAL-i175t-85ab4b14241b49211fbadd1465004cb28832242e9eb7ed9ffdcfcb54fd64e0fa3 |
IEDL.DBID | RIE |
ISBN | 9780769524863 0769524869 |
IngestDate | Wed Aug 27 02:43:26 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i175t-85ab4b14241b49211fbadd1465004cb28832242e9eb7ed9ffdcfcb54fd64e0fa3 |
ParticipantIDs | ieee_primary_1592295 |
PublicationCentury | 2000 |
PublicationDate | 20050000 |
PublicationDateYYYYMMDD | 2005-01-01 |
PublicationDate_xml | – year: 2005 text: 20050000 |
PublicationDecade | 2000 |
PublicationTitle | Eighth International Conference on High Performance Computing in Asia Pacific Region : proceedings : 30 November - 3 December, Beijing, China |
PublicationTitleAbbrev | HPCASIA |
PublicationYear | 2005 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0000396038 |
Score | 1.3452364 |
Snippet | Fault tolerance is an inevitable part of cluster operating system. In Score cluster system, it provides coordinated checkpointing, rollback recovery mechanism... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 5 pp. |
SubjectTerms | Algorithm design and analysis Checkpointing Clustering algorithms Computers Concurrent computing Detectors Fault detection Fault tolerant systems High performance computing Operating systems |
Title | A faster checkpointing and recovery algorithm with a hierarchical storage approach |
URI | https://ieeexplore.ieee.org/document/1592295 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8JAEJ4AJ0-oYFTU7MGjhT62ryMhEjTBEJWEG9ltZ5WgLSHloL_e2W2L0Xjwtu1lN_uY93wfwLVMwsgPFOr8oGvxMBSWsCXSiLwehwxoGehu5OlDMJnz-4W_aMDNvhcGEU3xGfb10OTy0zzZ6VDZgFSvZp9uQpOuWdmrtY-n2B7Z4l5Ueuax7_IoiCuAnfrbq0AbHTseTGaj4dPdsAyquD-oVYxmGbdhWq-pLChZ93eF7Cefv-Aa_7voQ-h-9_Cx2V47HUEDs2No1yQOrHrTHXgcMiU0WgKj40vWm3xluCOYyFKmvWW66h9MvL3k21Xx-s503JYJphm0TQ6CjpjpAksSS6zGJ-_CfHz7PJpYFdGCtSLrobAiX0iuo0HckTwml1BJEnskQ316QonUhMSk6V2MUYaYxkqliUqkz1UacLSV8E6gleUZngIjlc-Vg1ynjjn5LpJjyoWiGexIhE56Bh29RctNiaWxrHbn_O_fPTgwUKkm5HEBrWK7w0syAgp5ZU7_C40grrg |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JT8JAFH5BPOgJFYy7c_Booct0OxIiqQqEKCTcyEw7owRtCSkH_fW-mbYYjQdv015mMsvb3_cB3PDYD1xPCpUftA3q-8xgJhc4Qq_HQgOae6obeTjyoil9mLmzGtxue2GEELr4TLTVUOfykyzeqFBZB1WvYp_egV3U-9QturW2ERXTQWvcCQrfPHRtGnhhCbFTfTslbKNlhp1o3Os-33eLsIr9g1xF65Z-A4bVqoqSkmV7k_N2_PkLsPG_yz6A1ncXHxlv9dMh1ER6BI2KxoGUr7oJT10imcJLIHiA8XKVLTR7BGFpQpS_jJf9g7C3l2y9yF_fiYrcEkYUh7bOQuAhE1ViiYKJVAjlLZj27ya9yCipFowF2g-5EbiMUxUPohanITqFkqPgQynq4iOKuaIkRl1vi1BwXyShlEksY-5SmXhUmJI5x1BPs1ScAEGlT6UlqEoeU_ReOBUJZRJnMAPmW8kpNNUWzVcFmsa83J2zv39fw140GQ7mg_vR4znsa-BUHQC5gHq-3ohLNAlyfqVvwhfkkrIF |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Eighth+International+Conference+on+High+Performance+Computing+in+Asia+Pacific+Region+%3A+proceedings+%3A+30+November+-+3+December%2C+Beijing%2C+China&rft.atitle=A+faster+checkpointing+and+recovery+algorithm+with+a+hierarchical+storage+approach&rft.au=Gao%2C+Wen&rft.au=Mingyu+Chen&rft.au=Nanya%2C+T.&rft.date=2005-01-01&rft.pub=IEEE&rft.isbn=9780769524863&rft.spage=5+pp.&rft.epage=402&rft_id=info:doi/10.1109%2FHPCASIA.2005.2&rft.externalDocID=1592295 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780769524863/lc.gif&client=summon&freeimage=true |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780769524863/mc.gif&client=summon&freeimage=true |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=9780769524863/sc.gif&client=summon&freeimage=true |