An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL
Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods....
Saved in:
Published in | IEEE transactions on parallel and distributed systems Vol. 29; no. 4; pp. 758 - 771 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.04.2018
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
ISSN | 1045-9219 1558-2183 |
DOI | 10.1109/TPDS.2017.2781257 |
Cover
Loading…
Abstract | Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk. In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50 percent. To validate our method, we apply self-checkpoint method to an important problem: High-Performance Linpack (HPL) with fault tolerance. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95 percent of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47 percent and the performance by 5 percent. |
---|---|
AbstractList | Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability. In-memory/diskless checkpoint has gained extensive attention as a solution to avoid the IO bottleneck of traditional disk-based checkpoint methods. However, applications using previous in-memory checkpoint suffer from little available memory space. To provide high reliability, previous in-memory checkpoint methods either need to keep two copies of checkpoints to tolerate failures while updating old checkpoints or trade performance for space by flushing in-memory checkpoints into disk. In this paper, we propose a novel in-memory checkpoint method, called self-checkpoint, which can not only achieve the same reliability of previous in-memory checkpoint methods, but also increase the available memory space for applications by almost 50 percent. To validate our method, we apply self-checkpoint method to an important problem: High-Performance Linpack (HPL) with fault tolerance. We implement a scalable and fault tolerant HPL based on this new method, called SKT-HPL, and validate it on two large-scale systems. Experimental results with 24,576 processes show that SKT-HPL achieves over 95 percent of the performance of the original HPL. Compared to the state-of-the-art in-memory checkpoint method, it improves the available memory size by 47 percent and the performance by 5 percent. |
Author | Xiongchao Tang Wenguang Chen Weimin Zheng Bowen Yu Keqin Li Jidong Zhai |
Author_xml | – sequence: 1 givenname: Xiongchao orcidid: 0000-0002-1692-3964 surname: Tang fullname: Tang, Xiongchao – sequence: 2 givenname: Jidong surname: Zhai fullname: Zhai, Jidong – sequence: 3 givenname: Bowen surname: Yu fullname: Yu, Bowen – sequence: 4 givenname: Wenguang surname: Chen fullname: Chen, Wenguang – sequence: 5 givenname: Weimin surname: Zheng fullname: Zheng, Weimin – sequence: 6 givenname: Keqin orcidid: 0000-0001-5224-4048 surname: Li fullname: Li, Keqin |
BookMark | eNp9kMFqAjEQQEOxUGv7AaWXhZ7XZpLsZnMUq1VQFGrPSzbOYqwmNhsP_n1XlB566ClDeG8G3j3pOO-QkCegfQCqXlfLt48-oyD7TBbAMnlDupBlRcqg4J12piJLFQN1R-6bZkspiIyKLlkMXDKqa2ssuphMXTrHvQ-nZLhB83Xwtv2cY9z4daLdOrGxSZZBm2gNJt4lY33cxXTldxh0S06WswdyW-tdg4_Xt0c-x6PVcJLOFu_T4WCWGqZ4TCvOalXllVnnmhW1pEKsK54XosoFZEVVUao4SjC5MpIrEMhEppEiGKVVxXmPvFz2HoL_PmITy60_BteeLBlIkXEpOLSUvFAm-KYJWJfGRh2tdzFouyuBlud65bleea5XXuu1JvwxD8HudTj96zxfHIuIv3wBknIA_gMHNHsL |
CODEN | ITDSEO |
CitedBy_id | crossref_primary_10_1016_j_cosrev_2024_100660 crossref_primary_10_1002_cpe_8081 crossref_primary_10_1109_ACCESS_2019_2903588 crossref_primary_10_1109_TPDS_2020_3015615 crossref_primary_10_1177_10943420211055188 crossref_primary_10_1109_TPDS_2019_2937492 |
Cites_doi | 10.1145/50202.50214 10.1145/1048935.1050176 10.1007/s11227-013-0884-0 10.1109/71.730527 10.1109/9780470546345 10.1109/DSNW.2012.6264677 10.2172/1081941 10.1145/322123.322131 10.1145/1996130.1996142 10.1109/IPDPS.2012.48 10.1109/TST.2016.7488743 10.1109/DSN.2013.6575356 10.1145/2063384.2063427 10.1007/11846802_26 10.1007/978-3-642-32820-6_48 10.1145/2503210.2503226 10.1145/2442516.2442533 10.1109/SC.2010.18 10.1145/1065944.1065973 10.1109/TC.1984.1676475 10.1145/1542275.1542326 10.1088/1742-6596/46/1/067 10.1007/3-540-45255-9_47 10.1145/2600212.2600232 10.1109/TDSC.2009.4 10.1109/CCGRID.2010.40 10.1145/3018743.3018745 10.1145/1995896.1995923 10.1145/2063384.2063443 10.1109/ICPADS.2010.48 10.1109/TST.2016.7590316 10.1109/FTCS.1994.315631 10.1109/HiPC.2011.6152716 10.1145/1654059.1654117 10.1016/j.future.2004.11.016 10.1109/ICPP.2012.45 10.1109/IPDPS.2007.370307 10.1109/TPDS.2016.2537334 10.1145/1006209.1006248 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2018 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/TPDS.2017.2781257 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering Computer Science |
EISSN | 1558-2183 |
EndPage | 771 |
ExternalDocumentID | 10_1109_TPDS_2017_2781257 8170311 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Key Research and Development Program of China grantid: 2017YFB1003103 – fundername: NSFC grantid: 61232008; 61722208 – fundername: Tsinghua University Initiative Scientific Research Program – fundername: Microsoft Research Asia Collaborative Research Program grantid: FY16-RES-THEME-095 |
GroupedDBID | --Z -~X .DC 0R~ 29I 4.4 5GY 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACIWK AENEX AGQYO AHBIQ AKJIK AKQYR ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ IEDLZ IFIPE IPLJI JAVBF LAI M43 MS~ O9- OCL P2P PQQKQ RIA RIE RNS TN5 TWZ UHB AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c293t-b32f9b6bcd6a28f7044db3684b64158bb0093e71c69c73914e245ae0e1c9a9b33 |
IEDL.DBID | RIE |
ISSN | 1045-9219 |
IngestDate | Sun Sep 07 03:42:35 EDT 2025 Thu Apr 24 22:57:30 EDT 2025 Tue Jul 01 03:58:37 EDT 2025 Wed Aug 27 02:52:20 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 4 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c293t-b32f9b6bcd6a28f7044db3684b64158bb0093e71c69c73914e245ae0e1c9a9b33 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0002-1692-3964 0000-0001-5224-4048 |
PQID | 2174537431 |
PQPubID | 85437 |
PageCount | 14 |
ParticipantIDs | crossref_primary_10_1109_TPDS_2017_2781257 proquest_journals_2174537431 ieee_primary_8170311 crossref_citationtrail_10_1109_TPDS_2017_2781257 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2018-04-01 |
PublicationDateYYYYMMDD | 2018-04-01 |
PublicationDate_xml | – month: 04 year: 2018 text: 2018-04-01 day: 01 |
PublicationDecade | 2010 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE transactions on parallel and distributed systems |
PublicationTitleAbbrev | TPDS |
PublicationYear | 2018 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref35 ref13 ref34 ref12 ref37 ref15 ref36 ref14 ref31 ref30 ref33 ref11 ref32 ref10 ref2 ref39 ref17 ref16 ref19 ref18 (ref1) 0 zheng (ref41) 2004 ref24 ref23 ref26 ref25 ref20 ref42 ref22 robert (ref29) 2014 ref21 zhou (ref43) 2017 ref28 ref27 ref8 ref7 ref9 ref4 ref3 ref6 ref5 ye (ref38) 2016; 21 ref40 |
References_xml | – ident: ref26 doi: 10.1145/50202.50214 – ident: ref5 doi: 10.1145/1048935.1050176 – ident: ref13 doi: 10.1007/s11227-013-0884-0 – ident: ref28 doi: 10.1109/71.730527 – ident: ref35 doi: 10.1109/9780470546345 – ident: ref42 doi: 10.1109/DSNW.2012.6264677 – ident: ref17 doi: 10.2172/1081941 – ident: ref18 doi: 10.1145/322123.322131 – ident: ref6 doi: 10.1145/1996130.1996142 – ident: ref37 doi: 10.1109/IPDPS.2012.48 – volume: 21 start-page: 322 year: 2016 ident: ref38 article-title: An anomalous behavior detection model in cloud computing publication-title: Tsinghua Sci Technol doi: 10.1109/TST.2016.7488743 – ident: ref14 doi: 10.1109/DSN.2013.6575356 – ident: ref3 doi: 10.1145/2063384.2063427 – ident: ref12 doi: 10.1007/11846802_26 – ident: ref4 doi: 10.1007/978-3-642-32820-6_48 – ident: ref24 doi: 10.1145/2503210.2503226 – ident: ref7 doi: 10.1145/2442516.2442533 – year: 0 ident: ref1 – ident: ref25 doi: 10.1109/SC.2010.18 – ident: ref8 doi: 10.1145/1065944.1065973 – ident: ref21 doi: 10.1109/TC.1984.1676475 – start-page: 93 year: 2004 ident: ref41 article-title: FTC-Charm++: An in-memory checkpoint-based fault tolerant runtime for Charm++ and MPI publication-title: Proc IEEE Int Conf Cluster Comput – ident: ref23 doi: 10.1145/1542275.1542326 – ident: ref20 doi: 10.1088/1742-6596/46/1/067 – ident: ref15 doi: 10.1007/3-540-45255-9_47 – ident: ref36 doi: 10.1145/2600212.2600232 – ident: ref30 doi: 10.1109/TDSC.2009.4 – ident: ref19 doi: 10.1109/CCGRID.2010.40 – ident: ref31 doi: 10.1145/3018743.3018745 – ident: ref10 doi: 10.1145/1995896.1995923 – ident: ref16 doi: 10.1145/2063384.2063443 – ident: ref33 doi: 10.1109/ICPADS.2010.48 – year: 2014 ident: ref29 article-title: Fault-tolerance techniques for computing at scale publication-title: Proc IEEE/ACM Int Conf Cluster Cloud Grid Comput – ident: ref39 doi: 10.1109/TST.2016.7590316 – ident: ref27 doi: 10.1109/FTCS.1994.315631 – ident: ref34 doi: 10.1109/HiPC.2011.6152716 – ident: ref11 doi: 10.1145/1654059.1654117 – ident: ref9 doi: 10.1016/j.future.2004.11.016 – start-page: 1 year: 2017 ident: ref43 article-title: LX-SSD: Enhancing the lifespan of NAND flash-based memory via recycling invalid page publication-title: Proc 30th Int Conf Massive Storage Syst Technol – ident: ref22 doi: 10.1109/ICPP.2012.45 – ident: ref32 doi: 10.1109/IPDPS.2007.370307 – ident: ref40 doi: 10.1109/TPDS.2016.2537334 – ident: ref2 doi: 10.1145/1006209.1006248 |
SSID | ssj0014504 |
Score | 2.2718735 |
Snippet | Fault tolerance is increasingly important in high-performance computing due to the substantial growth of system scale and decreasing system reliability.... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 758 |
SubjectTerms | Computer memory Encoding Fault tolerance Fault tolerant systems fault-tolerant HPL in-memory checkpoint Large-scale systems memory consumption Memory management Methods Random access memory Servers State of the art System reliability |
Title | An Efficient In-Memory Checkpoint Method and its Practice on Fault-Tolerant HPL |
URI | https://ieeexplore.ieee.org/document/8170311 https://www.proquest.com/docview/2174537431 |
Volume | 29 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFH-oJz04P3E6JQdPYmfbpGlyFHVMcTpwgrfSpG8ojla0O-hfb5JmQ1TEWylJCPzey--95H0AHCIzxyFqHuSSscZBEQx5EI1ZQqXOlXS5MIMb3r9nVw_JwwIcz3NhENEFn2HXfrq3_KLSU3tVdmKLyVGbyLtoxKzJ1Zq_GLDEtQo03kUSSKOG_gUzCuXJaHh-Z4O40m6cGj6zTPSFg1xTlR8nsaOXXgsGs401USXP3WmtuvrjW83G_-58DVa9nUlOG8FYhwUsN6A16-FAvEpvwMqXgoSbcHtakgtXVMIsRy7LYGADcd_J2SPq55fqyfwcuJbTJC8L8lS_kaHPsiJVSXr5dFIHo2qChgBr0h9eb8F972J01g98y4VAG96vA0XjsVRc6YLnsRinIWOFolwwxQ3TC6XsDQimkeZSp1RGDGOW5BhipGUuFaXbsFRWJe4AiUSIlAksuOAGHCniQuQS2ZhGScGLsA3hDIRM-3rkti3GJHN-SSgzi1tmccs8bm04mk95aYpx_DV40-IwH-ghaENnhnTm1fUts35ZQq0xtfv7rD1YNmuLJmSnA0v16xT3jTVSqwMnhp_5FNf2 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT9swFH5CcGA7jK1sWjfGfOCESElix7GPCFoVaKASReIWxc6rQFRJtaaH7a-f7bgVAoS4RZGdWPr8_L3n9wvgAJk5DlHzoJCMtQaKYMiDaMoSKnWhpMuFya748JZd3CV3G3C0zoVBRBd8hj376Hz5Za2X9qrs2BaTozaRd8vwPkvabK21z4AlrlmgsS-SQBpB9D7MKJTHk_HZjQ3jSntxahjNctETFnJtVV6cxY5gBjuQrZbWxpU89paN6ul_z6o2vnftn-GT1zTJSbs1vsAGVh3YWXVxIF6oO_DxSUnCXbg-qUjflZUwnyPnVZDZUNy_5PQe9eO8fjAvM9d0mhRVSR6aBRn7PCtSV2RQLGdNMKlnaCiwIcPx6CvcDvqT02Hgmy4E2jB_EygaT6XiSpe8iMU0DRkrFeWCKW64Xihl70AwjTSXOqUyYhizpMAQIy0LqSj9BptVXeF3IJEIkTKBJRfcgCNFXIpCIpvSKCl5GXYhXIGQa1-R3DbGmOXOMgllbnHLLW65x60Lh-sp87Ycx1uDdy0O64Eegi7srZDOvcAucmuZJdSqUz9en_UbtoeTbJSPzq8uf8IH8x_RBvDswWbzZ4m_jG7SqH23Jf8Dc9zbQw |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=An+Efficient+In-Memory+Checkpoint+Method+and+its+Practice+on+Fault-Tolerant+HPL&rft.jtitle=IEEE+transactions+on+parallel+and+distributed+systems&rft.au=Xiongchao+Tang&rft.au=Jidong+Zhai&rft.au=Bowen+Yu&rft.au=Wenguang+Chen&rft.date=2018-04-01&rft.pub=IEEE&rft.issn=1045-9219&rft.volume=29&rft.issue=4&rft.spage=758&rft.epage=771&rft_id=info:doi/10.1109%2FTPDS.2017.2781257&rft.externalDocID=8170311 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1045-9219&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1045-9219&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1045-9219&client=summon |