Correlated set coordination in fault tolerant message logging protocols for many-core clusters

SUMMARYWith our current expectation for the exascale systems, composed of hundred of thousands of many‐core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is th...

Full description

Saved in:
Bibliographic Details
Published inConcurrency and computation Vol. 25; no. 4; pp. 572 - 585
Main Authors Bouteiller, Aurelien, Herault, Thomas, Bosilca, George, Dongarra, Jack J.
Format Journal Article
LanguageEnglish
Published Blackwell Publishing Ltd 01.02.2013
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:SUMMARYWith our current expectation for the exascale systems, composed of hundred of thousands of many‐core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes. Copyright © 2012 John Wiley & Sons, Ltd.
Bibliography:ArticleID:CPE2859
ark:/67375/WNG-PM397PK9-0
istex:1F7C355B9776DE131D7950A7BB9E6126336AF643
ObjectType-Article-2
SourceType-Scholarly Journals-1
ObjectType-Feature-1
content type line 23
ISSN:1532-0626
1532-0634
DOI:10.1002/cpe.2859