Correlated set coordination in fault tolerant message logging protocols for many-core clusters

SUMMARYWith our current expectation for the exascale systems, composed of hundred of thousands of many‐core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is th...

Full description

Saved in:

Bibliographic Details
Published in	Concurrency and computation Vol. 25; no. 4; pp. 572 - 585
Main Authors	Bouteiller, Aurelien, Herault, Thomas, Bosilca, George, Dongarra, Jack J.
Format	Journal Article
Language	English
Published	Blackwell Publishing Ltd 01.02.2013
Subjects	checkpoint/restart Concurrency Correlation Fault tolerance Logging Messages multicore clusters Payloads
Online Access	Get full text

Cover

Loading…

More Information
Summary:	SUMMARYWith our current expectation for the exascale systems, composed of hundred of thousands of many‐core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes. Copyright © 2012 John Wiley & Sons, Ltd.
Bibliography:	ArticleID:CPE2859 ark:/67375/WNG-PM397PK9-0 istex:1F7C355B9776DE131D7950A7BB9E6126336AF643 ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
ISSN:	1532-0626 1532-0634
DOI:	10.1002/cpe.2859