Doomsday: Predicting Which Node Will Fail When on Supercomputers
Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial ef...
Saved in:
Published in | SC18: International Conference for High Performance Computing, Networking, Storage and Analysis pp. 108 - 121 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.11.2018
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems. |
---|---|
AbstractList | Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems. |
Author | Hargrove, Paul Das, Anwesha Mueller, Frank Roman, Eric Baden, Scott |
Author_xml | – sequence: 1 givenname: Anwesha surname: Das fullname: Das, Anwesha – sequence: 2 givenname: Frank surname: Mueller fullname: Mueller, Frank – sequence: 3 givenname: Paul surname: Hargrove fullname: Hargrove, Paul – sequence: 4 givenname: Eric surname: Roman fullname: Roman, Eric – sequence: 5 givenname: Scott surname: Baden fullname: Baden, Scott |
BookMark | eNotzLFOwzAQgGEjwUALIxOLXyDhLq4dmwkUKCBVgFRQx-rsXKilJI6SdujbUwmmX_qGfybO-9SzEDcIOSK4u3WVF4A2BwAszsQMtbLGKrtwl-LhKaVuqul4Lz9HrmPYx_5HbnYx7OR7qlluYtvKJcX2hNzL1Mv1YeAxpG447HmcrsRFQ-3E1_-di-_l81f1mq0-Xt6qx1VGqO0-c6GBArWBoMEGC7QwpamD14RgVWnJMSl01uiGitp7dWLvlEdTkteO1Fzc_n0jM2-HMXY0HrfWGF26Qv0Cw9hD8g |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/SC.2018.00012 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
EISBN | 1538683849 9781538683842 |
EndPage | 121 |
ExternalDocumentID | 8665792 |
Genre | orig-research |
GroupedDBID | 6IE 6IL CBEJK RIE RIL |
ID | FETCH-LOGICAL-a158t-9cf021560c508c80a4676dcb5a108378a9ea319865fa2dbb3108b93b167ab59a3 |
IEDL.DBID | RIE |
IngestDate | Thu Jun 29 18:39:01 EDT 2023 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a158t-9cf021560c508c80a4676dcb5a108378a9ea319865fa2dbb3108b93b167ab59a3 |
PageCount | 14 |
ParticipantIDs | ieee_primary_8665792 |
PublicationCentury | 2000 |
PublicationDate | 2018-Nov |
PublicationDateYYYYMMDD | 2018-11-01 |
PublicationDate_xml | – month: 11 year: 2018 text: 2018-Nov |
PublicationDecade | 2010 |
PublicationTitle | SC18: International Conference for High Performance Computing, Networking, Storage and Analysis |
PublicationTitleAbbrev | SC |
PublicationYear | 2018 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
Score | 1.8993813 |
Snippet | Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail.... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 108 |
SubjectTerms | Blades Correlation Failure Analysis Hardware HPC Machine Learning Monitoring Noise measurement Resilience Supercomputers |
Title | Doomsday: Predicting Which Node Will Fail When on Supercomputers |
URI | https://ieeexplore.ieee.org/document/8665792 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JSwMxFH60PXlSacWdHDw67WxZxpNQLUVoEWqxt5LlDS3KTGlnDvrrTWZqK-LBW0ggK8n3knzvewA33EIuRjTymObci5WIPaEi6sVMMYwEGlNpd47GbDiNn2Z01oDbnS8MIlbkM-y6ZPWXb3JduqeyntNms5U3oSn8sPbV2stm9iZ9x9Ry1EjfxZf8ESylworBIYy-W6kpIm_dslBd_flLgPG_3TiCzt4rjzzv8OYYGpi14f7BWr4bIz_ubJn7dHE0ZvK6WOoFGecGiXtQIQO5fLeZmJE8I5NyhWu9jeaw6cB08PjSH3rbqAieDKgovESnDqeZr61tpYUv7VHHjFZUBr5Th5cJSruvBKOpDI1S1n4TKolUwLhUNJHRCbSyPMNTIEyEUhmepiwVccyVvboFmjMaMqOsKSLOoO1GP1_Vwhfz7cDP_86-gAM3_7Wj3iW0inWJVxaxC3VdLdUXUjmW2Q |
link.rule.ids | 310,311,786,790,795,796,802,27956,55107 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEJ4gHvSkBoxve_BoYZfdPtaTCUpQgZgA0RvpawPR7BLYPeivt91FMMaDt6ZN-sik_abtN98AXDELuSYgAaaKMRxKHmIuA4JDKqkJuNG60O7sD2h3HD6-ktcKXK9jYYwxBfnMNFyx-MvXqcrdU1nTabPZzrdg2-K8x8porY1wZnPYdlwtR470XIbJH-lSCrTo7EH_e5ySJPLWyDPZUJ-_JBj_O5F9qG_i8tDzGnEOoGKSGtzeWd93qcXHjW1z3y6OyIxepjM1RYNUG-SeVFBHzN5tpUlQmqBhPjcLtcrnsKzDuHM_anfxKi8CFj7hGY5U7JCaesp6V4p7wh52VCtJhO85fXgRGWF3FqckFi0tpfXguIwC6VMmJIlEcAjVJE3MESDKW0JqFsc05mHIpL28-YpR0qJaWmeEH0PNrX4yL6UvJquFn_xdfQk73VG_N-k9DJ5OYdfZogzbO4NqtsjNucXvTF4UZvsCdvaaLQ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=SC18%3A+International+Conference+for+High+Performance+Computing%2C+Networking%2C+Storage+and+Analysis&rft.atitle=Doomsday%3A+Predicting+Which+Node+Will+Fail+When+on+Supercomputers&rft.au=Das%2C+Anwesha&rft.au=Mueller%2C+Frank&rft.au=Hargrove%2C+Paul&rft.au=Roman%2C+Eric&rft.date=2018-11-01&rft.pub=IEEE&rft.spage=108&rft.epage=121&rft_id=info:doi/10.1109%2FSC.2018.00012&rft.externalDocID=8665792 |