An evaluation of User-Level Failure Mitigation support in MPI

As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming mode...

Full description

Saved in:
Bibliographic Details
Published inComputing Vol. 95; no. 12; pp. 1171 - 1184
Main Authors Bland, Wesley, Bouteiller, Aurelien, Herault, Thomas, Hursey, Joshua, Bosilca, George, Dongarra, Jack J.
Format Journal Article
LanguageEnglish
Published Vienna Springer Vienna 01.12.2013
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:As the scale of computing platforms becomes increasingly extreme, the requirements for application fault tolerance are increasing as well. Techniques to address this problem by improving the resilience of algorithms have been developed, but they currently receive no support from the programming model, and without such support, they are bound to fail. This paper discusses the failure-free overhead and recovery impact of the user-level failure mitigation proposal presented in the MPI Forum. Experiments demonstrate that fault-aware MPI has little or no impact on performance for a range of applications, and produces satisfactory recovery times when there are failures.
ISSN:0010-485X
1436-5057
DOI:10.1007/s00607-013-0331-3