Doomsday: Predicting Which Node Will Fail When on Supercomputers

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial ef...

Full description

Saved in:
Bibliographic Details
Published inSC18: International Conference for High Performance Computing, Networking, Storage and Analysis pp. 108 - 121
Main Authors Das, Anwesha, Mueller, Frank, Hargrove, Paul, Roman, Eric, Baden, Scott
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2018
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems.
AbstractList Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems.
Author Hargrove, Paul
Das, Anwesha
Mueller, Frank
Roman, Eric
Baden, Scott
Author_xml – sequence: 1
  givenname: Anwesha
  surname: Das
  fullname: Das, Anwesha
– sequence: 2
  givenname: Frank
  surname: Mueller
  fullname: Mueller, Frank
– sequence: 3
  givenname: Paul
  surname: Hargrove
  fullname: Hargrove, Paul
– sequence: 4
  givenname: Eric
  surname: Roman
  fullname: Roman, Eric
– sequence: 5
  givenname: Scott
  surname: Baden
  fullname: Baden, Scott
BookMark eNotzLFOwzAQgGEjwUALIxOLXyDhLq4dmwkUKCBVgFRQx-rsXKilJI6SdujbUwmmX_qGfybO-9SzEDcIOSK4u3WVF4A2BwAszsQMtbLGKrtwl-LhKaVuqul4Lz9HrmPYx_5HbnYx7OR7qlluYtvKJcX2hNzL1Mv1YeAxpG447HmcrsRFQ-3E1_-di-_l81f1mq0-Xt6qx1VGqO0-c6GBArWBoMEGC7QwpamD14RgVWnJMSl01uiGitp7dWLvlEdTkteO1Fzc_n0jM2-HMXY0HrfWGF26Qv0Cw9hD8g
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/SC.2018.00012
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1538683849
9781538683842
EndPage 121
ExternalDocumentID 8665792
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-a158t-9cf021560c508c80a4676dcb5a108378a9ea319865fa2dbb3108b93b167ab59a3
IEDL.DBID RIE
IngestDate Thu Jun 29 18:39:01 EDT 2023
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a158t-9cf021560c508c80a4676dcb5a108378a9ea319865fa2dbb3108b93b167ab59a3
PageCount 14
ParticipantIDs ieee_primary_8665792
PublicationCentury 2000
PublicationDate 2018-Nov
PublicationDateYYYYMMDD 2018-11-01
PublicationDate_xml – month: 11
  year: 2018
  text: 2018-Nov
PublicationDecade 2010
PublicationTitle SC18: International Conference for High Performance Computing, Networking, Storage and Analysis
PublicationTitleAbbrev SC
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.8993813
Snippet Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail....
SourceID ieee
SourceType Publisher
StartPage 108
SubjectTerms Blades
Correlation
Failure Analysis
Hardware
HPC
Machine Learning
Monitoring
Noise measurement
Resilience
Supercomputers
Title Doomsday: Predicting Which Node Will Fail When on Supercomputers
URI https://ieeexplore.ieee.org/document/8665792
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JSwMxFH60PXlSacWdHDw67WxZxpNQLUVoEWqxt5LlDS3KTGlnDvrrTWZqK-LBW0ggK8n3knzvewA33EIuRjTymObci5WIPaEi6sVMMYwEGlNpd47GbDiNn2Z01oDbnS8MIlbkM-y6ZPWXb3JduqeyntNms5U3oSn8sPbV2stm9iZ9x9Ry1EjfxZf8ESylworBIYy-W6kpIm_dslBd_flLgPG_3TiCzt4rjzzv8OYYGpi14f7BWr4bIz_ubJn7dHE0ZvK6WOoFGecGiXtQIQO5fLeZmJE8I5NyhWu9jeaw6cB08PjSH3rbqAieDKgovESnDqeZr61tpYUv7VHHjFZUBr5Th5cJSruvBKOpDI1S1n4TKolUwLhUNJHRCbSyPMNTIEyEUhmepiwVccyVvboFmjMaMqOsKSLOoO1GP1_Vwhfz7cDP_86-gAM3_7Wj3iW0inWJVxaxC3VdLdUXUjmW2Q
link.rule.ids 310,311,786,790,795,796,802,27956,55107
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEJ4gHvSkBoxve_BoYZfdPtaTCUpQgZgA0RvpawPR7BLYPeivt91FMMaDt6ZN-sik_abtN98AXDELuSYgAaaKMRxKHmIuA4JDKqkJuNG60O7sD2h3HD6-ktcKXK9jYYwxBfnMNFyx-MvXqcrdU1nTabPZzrdg2-K8x8porY1wZnPYdlwtR470XIbJH-lSCrTo7EH_e5ySJPLWyDPZUJ-_JBj_O5F9qG_i8tDzGnEOoGKSGtzeWd93qcXHjW1z3y6OyIxepjM1RYNUG-SeVFBHzN5tpUlQmqBhPjcLtcrnsKzDuHM_anfxKi8CFj7hGY5U7JCaesp6V4p7wh52VCtJhO85fXgRGWF3FqckFi0tpfXguIwC6VMmJIlEcAjVJE3MESDKW0JqFsc05mHIpL28-YpR0qJaWmeEH0PNrX4yL6UvJquFn_xdfQk73VG_N-k9DJ5OYdfZogzbO4NqtsjNucXvTF4UZvsCdvaaLQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC18%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Doomsday%3A+Predicting+Which+Node+Will+Fail+When+on+Supercomputers&rft.au=Das%2C+Anwesha&rft.au=Mueller%2C+Frank&rft.au=Hargrove%2C+Paul&rft.au=Roman%2C+Eric&rft.date=2018-11-01&rft.pub=IEEE&rft.spage=108&rft.epage=121&rft_id=info:doi/10.1109%2FSC.2018.00012&rft.externalDocID=8665792