Doomsday: Predicting Which Node Will Fail When on Supercomputers

Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial ef...

Full description

Saved in:

Bibliographic Details
Published in	SC18: International Conference for High Performance Computing, Networking, Storage and Analysis pp. 108 - 121
Main Authors	Das, Anwesha, Mueller, Frank, Hargrove, Paul, Roman, Eric, Baden, Scott
Format	Conference Proceeding
Language	English
Published	IEEE 01.11.2018
Subjects	Blades Correlation Failure Analysis Hardware HPC Machine Learning Monitoring Noise measurement Resilience Supercomputers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Predicting which node will fail and how soon remains a challenge for HPC resilience, yet may pave the way to exploiting proactive remedies before jobs fail. Not only for increasing scalability up to exascale systems but even for contemporary supercomputer architectures does it require substantial efforts to distill anomalous events from noisy raw logs. To this end, we propose a novel phrase extraction mechanism called TBP (time-based phrases) to pin-point node failures, which is unprecedented. Our study, based on real system data and statistical machine learning, demonstrates the feasibility to predict which specific node will fail in Cray systems. TBP achieves no less than 83% recall rates with lead times as high as 2 minutes. This opens up the door for enhancing prediction lead times for supercomputing systems in general, thereby facilitating efficient usage of both computing capacity and power in large scale production systems.
DOI:	10.1109/SC.2018.00012