Node failure resiliency for Uintah without checkpointing

Summary The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current...

Full description

Saved in:
Bibliographic Details
Published inConcurrency and computation Vol. 31; no. 20
Main Authors Sahasrabudhe, Damodar, Berzins, Martin, Schmidt, John
Format Journal Article
LanguageEnglish
Published Hoboken Wiley Subscription Services, Inc 25.10.2019
Wiley
Subjects
Online AccessGet full text
ISSN1532-0626
1532-0634
DOI10.1002/cpe.5340

Cover

Loading…
More Information
Summary:Summary The frequency of failures in upcoming exascale supercomputers may well be greater than at present due to many‐core architectures if component failure rates remain unchanged. This potential increase in failure frequency coupled with I/O challenges at exascale may prove problematic for current resiliency approaches such as checkpoint restarting, although the use of fast intermediate memory may help. Algorithm‐based fault tolerance (ABFT) using adaptive mesh refinement (AMR) is one resiliency approach used to address these challenges. For adaptive mesh codes, a coarse mesh version of the solution may be used to restore the fine mesh solution. This paper addresses the implementation of the ABFT approach within the Uintah software framework: both at a software level within Uintah and in the data reconstruction method used for the recovery of lost data. This method has two problems: inaccuracies introduced during the reconstruction propagate forward in time, and the physical consistency of variables, such as positivity or boundedness, may be violated during interpolation. These challenges can be addressed by the combination of two techniques: (1) a fault‐tolerant message passing interface (MPI) implementation to recover from runtime node failures, and (2) high‐order interpolation schemes to preserve the physical solution and reconstruct lost data. The approach considered here uses a “limited essentially nonoscillatory” (LENO) scheme along with AMR to rebuild the lost data without checkpointing using Uintah. Experiments were carried out using a fault‐tolerant MPI‐user‐level failure mitigation to recover from runtime failure and LENO to recover data on patches belonging to failed ranks, while the simulation was continued to the end. Results show that this ABFT approach is up to 10× faster than the traditional checkpointing method. The new interpolation approach is more accurate than linear interpolation and not subject to the overshoots found in other interpolation methods.
Bibliography:Present Address
Damodar Sahasrabudhe, Scientific Computing and Imaging Institute, University of Utah, 72 Central Campus Dr, Salt Lake City, UT 84112
ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
National Science Foundation (NSF)
NA0002375; 1337145
USDOE National Nuclear Security Administration (NNSA)
ISSN:1532-0626
1532-0634
DOI:10.1002/cpe.5340