Data-Driven Application-Oriented Reliability Model of a High-Performance Computing System

Reliability analysis and performance evaluation are complementary methods to quantify nonfunctional aspects of a system. However, a range of factors such as concurrency and heterogeneity quickly exacerbate the state-space explosion problem when attempting detailed system-level modeling and simulatio...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on reliability Vol. 71; no. 2; pp. 603 - 615
Main Authors Jafary, Bentolhoda, Jha, Saurabh, Fiondella, Lance, Iyer, Ravishankar K.
Format Journal Article
LanguageEnglish
Published New York IEEE 01.06.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Reliability analysis and performance evaluation are complementary methods to quantify nonfunctional aspects of a system. However, a range of factors such as concurrency and heterogeneity quickly exacerbate the state-space explosion problem when attempting detailed system-level modeling and simulation of high-performance computing (HPC) systems. To overcome these impediments to modeling and analysis, this article develops a hierarchical model of an application that implements checkpointing running in an HPC environment subject to application, network, and system-wide outages. The modeling approach ensures that the number of states is linear in the number of checkpoints and possesses a low constant factor for the number of recovery states most relevant to the external influences contributing to degraded application performance. We illustrate the types of analysis enabled by the model through a series of examples with parameters determined empirically from data logs of the Blue Waters supercomputer located at the University of Illinois at Urbana-Champaign. A comprehensive comparative analysis of the model parameters indicates that lowering the failure rate of network nodes would most significantly reduce application downtime. We also discuss how the modeling approach can be used to objectively assess both current and hypothetical future systems to identify competitive designs and enhancements.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:0018-9529
1558-1721
DOI:10.1109/TR.2021.3085582