A Visual Comparison of Silent Error Propagation

High-performance computing (HPC) systems play a critical role in facilitating scientific discoveries. Their scale and complexity (e.g., the number of computational units and software stack) continue to grow as new systems are expected to process increasingly more data and reduce computing time. Howe...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on visualization and computer graphics Vol. 30; no. 7; pp. 3268 - 3282
Main Authors Li, Zhimin, Menon, Harshitha, Mohror, Kathryn, Liu, Shusen, Guo, Luanzheng, Bremer, Peer-Timo, Pascucci, Valerio
Format Journal Article
LanguageEnglish
Published United States IEEE 01.07.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:High-performance computing (HPC) systems play a critical role in facilitating scientific discoveries. Their scale and complexity (e.g., the number of computational units and software stack) continue to grow as new systems are expected to process increasingly more data and reduce computing time. However, with more processing elements, the probability that these systems will experience a random bit-flip error that corrupts a program's output also increases, which is often recognized as silent data corruption. Analyzing the resiliency of HPC applications in extreme-scale computing to silent data corruption is crucial but difficult. An HPC application often contains a large number of computation units that need to be tested, and error propagation caused by error corruption is complex and difficult to interpret. To accommodate this challenge, we propose an interactive visualization system that helps HPC researchers understand the resiliency of HPC applications and compare their error propagation. Our system models an application's error propagation to study a program's resiliency by constructing and visualizing its fault tolerance boundary. Coordinating with multiple interactive designs, our system enables domain experts to efficiently explore the complicated spatial and temporal correlation between error propagations. At the end, the system integrated a nonmonotonic error propagation analysis with an adjustable graph propagation visualization to help domain experts examine the details of error propagation and answer such questions as why an error is mitigated or amplified by program execution.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
LLNL-JRNL-843184
USDOE Office of Science (SC), Advanced Scientific Computing Research (ASCR)
AC52-07NA27344; SC0014098
USDOE National Nuclear Security Administration (NNSA)
ISSN:1077-2626
1941-0506
1941-0506
DOI:10.1109/TVCG.2022.3230636