Doomsday: Predicting Which Node Will Fail When on Supercomputers

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial ef...

Full description

Saved in:

Bibliographic Details
Published in	SC18: International Conference for High Performance Computing, Networking, Storage and Analysis pp. 108 - 121
Main Authors	Das, Anwesha, Mueller, Frank, Hargrove, Paul, Roman, Eric, Baden, Scott
Format	Conference Proceeding
Language	English
Published	IEEE 01.11.2018
Subjects	Blades Correlation Failure Analysis Hardware HPC Machine Learning Monitoring Noise measurement Resilience Supercomputers
Online Access	Get full text

Cover

Loading…

Abstract	Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems.
AbstractList	Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems.
Author	Hargrove, Paul Das, Anwesha Mueller, Frank Roman, Eric Baden, Scott
Author_xml	– sequence: 1 givenname: Anwesha surname: Das fullname: Das, Anwesha – sequence: 2 givenname: Frank surname: Mueller fullname: Mueller, Frank – sequence: 3 givenname: Paul surname: Hargrove fullname: Hargrove, Paul – sequence: 4 givenname: Eric surname: Roman fullname: Roman, Eric – sequence: 5 givenname: Scott surname: Baden fullname: Baden, Scott
BookMark	eNotzLFOwzAQgGEjwUALIxOLXyDhLq4dmwkUKCBVgFRQx-rsXKilJI6SdujbUwmmX_qGfybO-9SzEDcIOSK4u3WVF4A2BwAszsQMtbLGKrtwl-LhKaVuqul4Lz9HrmPYx_5HbnYx7OR7qlluYtvKJcX2hNzL1Mv1YeAxpG447HmcrsRFQ-3E1_-di-_l81f1mq0-Xt6qx1VGqO0-c6GBArWBoMEGC7QwpamD14RgVWnJMSl01uiGitp7dWLvlEdTkteO1Fzc_n0jM2-HMXY0HrfWGF26Qv0Cw9hD8g
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/SC.2018.00012
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
EISBN	1538683849 9781538683842
EndPage	121
ExternalDocumentID	8665792
Genre	orig-research
GroupedDBID	6IE 6IL CBEJK RIE RIL
ID	FETCH-LOGICAL-a158t-9cf021560c508c80a4676dcb5a108378a9ea319865fa2dbb3108b93b167ab59a3
IEDL.DBID	RIE
IngestDate	Thu Jun 29 18:39:01 EDT 2023
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-a158t-9cf021560c508c80a4676dcb5a108378a9ea319865fa2dbb3108b93b167ab59a3
PageCount	14
ParticipantIDs	ieee_primary_8665792
PublicationCentury	2000
PublicationDate	2018-Nov
PublicationDateYYYYMMDD	2018-11-01
PublicationDate_xml	– month: 11 year: 2018 text: 2018-Nov
PublicationDecade	2010
PublicationTitle	SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev	SC
PublicationYear	2018
Publisher	IEEE
Publisher_xml	– name: IEEE
Score	1.8993813
Snippet	Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail....
SourceID	ieee
SourceType	Publisher
StartPage	108
SubjectTerms	Blades Correlation Failure Analysis Hardware HPC Machine Learning Monitoring Noise measurement Resilience Supercomputers
Title	Doomsday: Predicting Which Node Will Fail When on Supercomputers
URI	https://ieeexplore.ieee.org/document/8665792
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JSwMxFH60PXlSacWdHDw67WxZxpNQLUVoEWqxt5LlDS3KTGlnDvrrTWZqK-LBW0ggK8n3knzvewA33EIuRjTymObci5WIPaEi6sVMMYwEGlNpd47GbDiNn2Z01oDbnS8MIlbkM-y6ZPWXb3JduqeyntNms5U3oSn8sPbV2stm9iZ9x9Ry1EjfxZf8ESylworBIYy-W6kpIm_dslBd_flLgPG_3TiCzt4rjzzv8OYYGpi14f7BWr4bIz_ubJn7dHE0ZvK6WOoFGecGiXtQIQO5fLeZmJE8I5NyhWu9jeaw6cB08PjSH3rbqAieDKgovESnDqeZr61tpYUv7VHHjFZUBr5Th5cJSruvBKOpDI1S1n4TKolUwLhUNJHRCbSyPMNTIEyEUhmepiwVccyVvboFmjMaMqOsKSLOoO1GP1_Vwhfz7cDP_86-gAM3_7Wj3iW0inWJVxaxC3VdLdUXUjmW2Q
link.rule.ids	310,311,786,790,795,796,802,27956,55107
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEJ4gHvSkBoxve_BoYZfdPtaTCUpQgZgA0RvpawPR7BLYPeivt91FMMaDt6ZN-sik_abtN98AXDELuSYgAaaKMRxKHmIuA4JDKqkJuNG60O7sD2h3HD6-ktcKXK9jYYwxBfnMNFyx-MvXqcrdU1nTabPZzrdg2-K8x8porY1wZnPYdlwtR470XIbJH-lSCrTo7EH_e5ySJPLWyDPZUJ-_JBj_O5F9qG_i8tDzGnEOoGKSGtzeWd93qcXHjW1z3y6OyIxepjM1RYNUG-SeVFBHzN5tpUlQmqBhPjcLtcrnsKzDuHM_anfxKi8CFj7hGY5U7JCaesp6V4p7wh52VCtJhO85fXgRGWF3FqckFi0tpfXguIwC6VMmJIlEcAjVJE3MESDKW0JqFsc05mHIpL28-YpR0qJaWmeEH0PNrX4yL6UvJquFn_xdfQk73VG_N-k9DJ5OYdfZogzbO4NqtsjNucXvTF4UZvsCdvaaLQ
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC18%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Doomsday%3A+Predicting+Which+Node+Will+Fail+When+on+Supercomputers&rft.au=Das%2C+Anwesha&rft.au=Mueller%2C+Frank&rft.au=Hargrove%2C+Paul&rft.au=Roman%2C+Eric&rft.date=2018-11-01&rft.pub=IEEE&rft.spage=108&rft.epage=121&rft_id=info:doi/10.1109%2FSC.2018.00012&rft.externalDocID=8665792