Node failure resiliency for Uintah without checkpointing
Summary The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current...
Saved in:
Published in | Concurrency and computation Vol. 31; no. 20 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Hoboken
Wiley Subscription Services, Inc
25.10.2019
Wiley |
Subjects | |
Online Access | Get full text |
ISSN | 1532-0626 1532-0634 |
DOI | 10.1002/cpe.5340 |
Cover
Loading…
Summary: | Summary
The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm‐based fault tolerance (ABFT) using adaptive mesh refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables, such as positivity or boundedness, may be violated during interpolation. These challenges can be addressed by the combination of two techniques: (1) a fault‐tolerant message passing interface (MPI) implementation to recover from runtime node failures, and (2) high‐order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a “limited essentially nonoscillatory” (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault‐tolerant MPI‐user‐level failure mitigation to recover from runtime failure and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10× faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods. |
---|---|
Bibliography: | Present Address Damodar Sahasrabudhe, Scientific Computing and Imaging Institute, University of Utah, 72 Central Campus Dr, Salt Lake City, UT 84112 ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 National Science Foundation (NSF) NA0002375; 1337145 USDOE National Nuclear Security Administration (NNSA) |
ISSN: | 1532-0626 1532-0634 |
DOI: | 10.1002/cpe.5340 |