An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL

Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods....

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on parallel and distributed systems Vol. 29; no. 4; pp. 758 - 771
Main Authors Tang, Xiongchao, Zhai, Jidong, Yu, Bowen, Chen, Wenguang, Zheng, Weimin, Li, Keqin
Format Journal Article
LanguageEnglish
Published New York IEEE 01.04.2018
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1045-9219
1558-2183
DOI10.1109/TPDS.2017.2781257

Cover

Loading…
Abstract Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk. In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50 percent. To validate our method, we apply self-checkpoint method to an important problem: High-Performance Linpack (HPL) with fault tolerance. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95 percent of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47 percent and the performance by 5 percent.
AbstractList Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk. In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50 percent. To validate our method, we apply self-checkpoint method to an important problem: High-Performance Linpack (HPL) with fault tolerance. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95 percent of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47 percent and the performance by 5 percent.
Author Xiongchao Tang
Wenguang Chen
Weimin Zheng
Bowen Yu
Keqin Li
Jidong Zhai
Author_xml – sequence: 1
  givenname: Xiongchao
  orcidid: 0000-0002-1692-3964
  surname: Tang
  fullname: Tang, Xiongchao
– sequence: 2
  givenname: Jidong
  surname: Zhai
  fullname: Zhai, Jidong
– sequence: 3
  givenname: Bowen
  surname: Yu
  fullname: Yu, Bowen
– sequence: 4
  givenname: Wenguang
  surname: Chen
  fullname: Chen, Wenguang
– sequence: 5
  givenname: Weimin
  surname: Zheng
  fullname: Zheng, Weimin
– sequence: 6
  givenname: Keqin
  orcidid: 0000-0001-5224-4048
  surname: Li
  fullname: Li, Keqin
BookMark eNp9kMFqAjEQQEOxUGv7AaWXhZ7XZpLsZnMUq1VQFGrPSzbOYqwmNhsP_n1XlB566ClDeG8G3j3pOO-QkCegfQCqXlfLt48-oyD7TBbAMnlDupBlRcqg4J12piJLFQN1R-6bZkspiIyKLlkMXDKqa2ssuphMXTrHvQ-nZLhB83Xwtv2cY9z4daLdOrGxSZZBm2gNJt4lY33cxXTldxh0S06WswdyW-tdg4_Xt0c-x6PVcJLOFu_T4WCWGqZ4TCvOalXllVnnmhW1pEKsK54XosoFZEVVUao4SjC5MpIrEMhEppEiGKVVxXmPvFz2HoL_PmITy60_BteeLBlIkXEpOLSUvFAm-KYJWJfGRh2tdzFouyuBlud65bleea5XXuu1JvwxD8HudTj96zxfHIuIv3wBknIA_gMHNHsL
CODEN ITDSEO
CitedBy_id crossref_primary_10_1016_j_cosrev_2024_100660
crossref_primary_10_1002_cpe_8081
crossref_primary_10_1109_ACCESS_2019_2903588
crossref_primary_10_1109_TPDS_2020_3015615
crossref_primary_10_1177_10943420211055188
crossref_primary_10_1109_TPDS_2019_2937492
Cites_doi 10.1145/50202.50214
10.1145/1048935.1050176
10.1007/s11227-013-0884-0
10.1109/71.730527
10.1109/9780470546345
10.1109/DSNW.2012.6264677
10.2172/1081941
10.1145/322123.322131
10.1145/1996130.1996142
10.1109/IPDPS.2012.48
10.1109/TST.2016.7488743
10.1109/DSN.2013.6575356
10.1145/2063384.2063427
10.1007/11846802_26
10.1007/978-3-642-32820-6_48
10.1145/2503210.2503226
10.1145/2442516.2442533
10.1109/SC.2010.18
10.1145/1065944.1065973
10.1109/TC.1984.1676475
10.1145/1542275.1542326
10.1088/1742-6596/46/1/067
10.1007/3-540-45255-9_47
10.1145/2600212.2600232
10.1109/TDSC.2009.4
10.1109/CCGRID.2010.40
10.1145/3018743.3018745
10.1145/1995896.1995923
10.1145/2063384.2063443
10.1109/ICPADS.2010.48
10.1109/TST.2016.7590316
10.1109/FTCS.1994.315631
10.1109/HiPC.2011.6152716
10.1145/1654059.1654117
10.1016/j.future.2004.11.016
10.1109/ICPP.2012.45
10.1109/IPDPS.2007.370307
10.1109/TPDS.2016.2537334
10.1145/1006209.1006248
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TPDS.2017.2781257
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1558-2183
EndPage 771
ExternalDocumentID 10_1109_TPDS_2017_2781257
8170311
Genre orig-research
GrantInformation_xml – fundername: National Key Research and Development Program of China
  grantid: 2017YFB1003103
– fundername: NSFC
  grantid: 61232008; 61722208
– fundername: Tsinghua University Initiative Scientific Research Program
– fundername: Microsoft Research Asia Collaborative Research Program
  grantid: FY16-RES-THEME-095
GroupedDBID --Z
-~X
.DC
0R~
29I
4.4
5GY
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACIWK
AENEX
AGQYO
AHBIQ
AKJIK
AKQYR
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
IEDLZ
IFIPE
IPLJI
JAVBF
LAI
M43
MS~
O9-
OCL
P2P
PQQKQ
RIA
RIE
RNS
TN5
TWZ
UHB
AAYXX
CITATION
RIG
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c293t-b32f9b6bcd6a28f7044db3684b64158bb0093e71c69c73914e245ae0e1c9a9b33
IEDL.DBID RIE
ISSN 1045-9219
IngestDate Sun Sep 07 03:42:35 EDT 2025
Thu Apr 24 22:57:30 EDT 2025
Tue Jul 01 03:58:37 EDT 2025
Wed Aug 27 02:52:20 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 4
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c293t-b32f9b6bcd6a28f7044db3684b64158bb0093e71c69c73914e245ae0e1c9a9b33
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0002-1692-3964
0000-0001-5224-4048
PQID 2174537431
PQPubID 85437
PageCount 14
ParticipantIDs crossref_primary_10_1109_TPDS_2017_2781257
proquest_journals_2174537431
ieee_primary_8170311
crossref_citationtrail_10_1109_TPDS_2017_2781257
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2018-04-01
PublicationDateYYYYMMDD 2018-04-01
PublicationDate_xml – month: 04
  year: 2018
  text: 2018-04-01
  day: 01
PublicationDecade 2010
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE transactions on parallel and distributed systems
PublicationTitleAbbrev TPDS
PublicationYear 2018
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref35
ref13
ref34
ref12
ref37
ref15
ref36
ref14
ref31
ref30
ref33
ref11
ref32
ref10
ref2
ref39
ref17
ref16
ref19
ref18
(ref1) 0
zheng (ref41) 2004
ref24
ref23
ref26
ref25
ref20
ref42
ref22
robert (ref29) 2014
ref21
zhou (ref43) 2017
ref28
ref27
ref8
ref7
ref9
ref4
ref3
ref6
ref5
ye (ref38) 2016; 21
ref40
References_xml – ident: ref26
  doi: 10.1145/50202.50214
– ident: ref5
  doi: 10.1145/1048935.1050176
– ident: ref13
  doi: 10.1007/s11227-013-0884-0
– ident: ref28
  doi: 10.1109/71.730527
– ident: ref35
  doi: 10.1109/9780470546345
– ident: ref42
  doi: 10.1109/DSNW.2012.6264677
– ident: ref17
  doi: 10.2172/1081941
– ident: ref18
  doi: 10.1145/322123.322131
– ident: ref6
  doi: 10.1145/1996130.1996142
– ident: ref37
  doi: 10.1109/IPDPS.2012.48
– volume: 21
  start-page: 322
  year: 2016
  ident: ref38
  article-title: An anomalous behavior detection model in cloud computing
  publication-title: Tsinghua Sci Technol
  doi: 10.1109/TST.2016.7488743
– ident: ref14
  doi: 10.1109/DSN.2013.6575356
– ident: ref3
  doi: 10.1145/2063384.2063427
– ident: ref12
  doi: 10.1007/11846802_26
– ident: ref4
  doi: 10.1007/978-3-642-32820-6_48
– ident: ref24
  doi: 10.1145/2503210.2503226
– ident: ref7
  doi: 10.1145/2442516.2442533
– year: 0
  ident: ref1
– ident: ref25
  doi: 10.1109/SC.2010.18
– ident: ref8
  doi: 10.1145/1065944.1065973
– ident: ref21
  doi: 10.1109/TC.1984.1676475
– start-page: 93
  year: 2004
  ident: ref41
  article-title: FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI
  publication-title: Proc IEEE Int Conf Cluster Comput
– ident: ref23
  doi: 10.1145/1542275.1542326
– ident: ref20
  doi: 10.1088/1742-6596/46/1/067
– ident: ref15
  doi: 10.1007/3-540-45255-9_47
– ident: ref36
  doi: 10.1145/2600212.2600232
– ident: ref30
  doi: 10.1109/TDSC.2009.4
– ident: ref19
  doi: 10.1109/CCGRID.2010.40
– ident: ref31
  doi: 10.1145/3018743.3018745
– ident: ref10
  doi: 10.1145/1995896.1995923
– ident: ref16
  doi: 10.1145/2063384.2063443
– ident: ref33
  doi: 10.1109/ICPADS.2010.48
– year: 2014
  ident: ref29
  article-title: Fault-tolerance techniques for computing at scale
  publication-title: Proc IEEE/ACM Int Conf Cluster Cloud Grid Comput
– ident: ref39
  doi: 10.1109/TST.2016.7590316
– ident: ref27
  doi: 10.1109/FTCS.1994.315631
– ident: ref34
  doi: 10.1109/HiPC.2011.6152716
– ident: ref11
  doi: 10.1145/1654059.1654117
– ident: ref9
  doi: 10.1016/j.future.2004.11.016
– start-page: 1
  year: 2017
  ident: ref43
  article-title: LX-SSD: Enhancing the lifespan of NAND flash-based memory via recycling invalid page
  publication-title: Proc 30th Int Conf Massive Storage Syst Technol
– ident: ref22
  doi: 10.1109/ICPP.2012.45
– ident: ref32
  doi: 10.1109/IPDPS.2007.370307
– ident: ref40
  doi: 10.1109/TPDS.2016.2537334
– ident: ref2
  doi: 10.1145/1006209.1006248
SSID ssj0014504
Score 2.2718735
Snippet Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability....
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 758
SubjectTerms Computer memory
Encoding
Fault tolerance
Fault tolerant systems
fault-tolerant HPL
in-memory checkpoint
Large-scale systems
memory consumption
Memory management
Methods
Random access memory
Servers
State of the art
System reliability
Title An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL
URI https://ieeexplore.ieee.org/document/8170311
https://www.proquest.com/docview/2174537431
Volume 29
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH-oJz04P3E6JQdPYmfbpGlyFHVMcTpwgrfSpG8ojla0O-hfb5JmQ1TEWylJCPzey--95H0AHCIzxyFqHuSSscZBEQx5EI1ZQqXOlXS5MIMb3r9nVw_JwwIcz3NhENEFn2HXfrq3_KLSU3tVdmKLyVGbyLtoxKzJ1Zq_GLDEtQo03kUSSKOG_gUzCuXJaHh-Z4O40m6cGj6zTPSFg1xTlR8nsaOXXgsGs401USXP3WmtuvrjW83G_-58DVa9nUlOG8FYhwUsN6A16-FAvEpvwMqXgoSbcHtakgtXVMIsRy7LYGADcd_J2SPq55fqyfwcuJbTJC8L8lS_kaHPsiJVSXr5dFIHo2qChgBr0h9eb8F972J01g98y4VAG96vA0XjsVRc6YLnsRinIWOFolwwxQ3TC6XsDQimkeZSp1RGDGOW5BhipGUuFaXbsFRWJe4AiUSIlAksuOAGHCniQuQS2ZhGScGLsA3hDIRM-3rkti3GJHN-SSgzi1tmccs8bm04mk95aYpx_DV40-IwH-ghaENnhnTm1fUts35ZQq0xtfv7rD1YNmuLJmSnA0v16xT3jTVSqwMnhp_5FNf2
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT9swFH5CcGA7jK1sWjfGfOCESElix7GPCFoVaKASReIWxc6rQFRJtaaH7a-f7bgVAoS4RZGdWPr8_L3n9wvgAJk5DlHzoJCMtQaKYMiDaMoSKnWhpMuFya748JZd3CV3G3C0zoVBRBd8hj376Hz5Za2X9qrs2BaTozaRd8vwPkvabK21z4AlrlmgsS-SQBpB9D7MKJTHk_HZjQ3jSntxahjNctETFnJtVV6cxY5gBjuQrZbWxpU89paN6ul_z6o2vnftn-GT1zTJSbs1vsAGVh3YWXVxIF6oO_DxSUnCXbg-qUjflZUwnyPnVZDZUNy_5PQe9eO8fjAvM9d0mhRVSR6aBRn7PCtSV2RQLGdNMKlnaCiwIcPx6CvcDvqT02Hgmy4E2jB_EygaT6XiSpe8iMU0DRkrFeWCKW64Xihl70AwjTSXOqUyYhizpMAQIy0LqSj9BptVXeF3IJEIkTKBJRfcgCNFXIpCIpvSKCl5GXYhXIGQa1-R3DbGmOXOMgllbnHLLW65x60Lh-sp87Ycx1uDdy0O64Eegi7srZDOvcAucmuZJdSqUz9en_UbtoeTbJSPzq8uf8IH8x_RBvDswWbzZ4m_jG7SqH23Jf8Dc9zbQw
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Efficient+In-Memory+Checkpoint+Method+and+its+Practice+on+Fault-Tolerant+HPL&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Xiongchao+Tang&rft.au=Jidong+Zhai&rft.au=Bowen+Yu&rft.au=Wenguang+Chen&rft.date=2018-04-01&rft.pub=IEEE&rft.issn=1045-9219&rft.volume=29&rft.issue=4&rft.spage=758&rft.epage=771&rft_id=info:doi/10.1109%2FTPDS.2017.2781257&rft.externalDocID=8170311
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon