Exploring void search for fault detection on extreme scale systems

Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential...

Full description

Saved in:

Bibliographic Details
Published in	2014 IEEE International Conference on Cluster Computing (CLUSTER) pp. 1 - 9
Main Authors	Berrocal, Eduardo, Li Yu, Wallace, Sean, Papka, Michael E., Zhiling Lan
Format	Conference Proceeding
Language	English
Published	IEEE 01.09.2014
Subjects	Blue Gene/Q Computers Environmental Data Fault Detection Lead Reliability Void Search
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Mean Time Between Failures (MTBF), now calculated in days or hours, is expected to drop to minutes on exascale machines. The advancement of resilience technologies greatly depends on a deeper understanding of faults arising from hardware and software components. This understanding has the potential to help us build better fault tolerance technologies. For instance, it has been proved that combining checkpointing and failure prediction leads to longer checkpoint intervals, which in turn leads to fewer total checkpoints. In this paper we present a new approach for fault detection based on the Void Search (VS) algorithm. VS is used primarily in astrophysics for finding areas of space that have a very low density of galaxies. We evaluate our algorithm using real environmental logs from Mira Blue Gene/Q supercomputer at Argonne National Laboratory. Our experiments show that our approach can detect almost all faults (i.e., sensitivity close to 1) with a low false positive rate (i.e., specificity values above 0.7). We also compare our algorithm with a number of existing detection algorithms, and find that ours outperforms all of them.
ISSN:	1552-5244 2168-9253
DOI:	10.1109/CLUSTER.2014.6968757