Correlated set coordination in fault tolerant message logging protocols for many-core clusters
SUMMARYWith our current expectation for the exascale systems, composed of hundred of thousands of many‐core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is th...
Saved in:
Published in | Concurrency and computation Vol. 25; no. 4; pp. 572 - 585 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Blackwell Publishing Ltd
01.02.2013
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | SUMMARYWith our current expectation for the exascale systems, composed of hundred of thousands of many‐core nodes, the mean time between failures will become small, even under the most optimistic assumptions. One of the most scalable checkpoint restart techniques, the message logging approach, is the most challenged when the number of cores per node increases because of the high overhead of saving the message payload. Fortunately, for two processes on the same node, the failure probability is correlated, meaning that coordinated recovery is free. In this paper, we propose an intermediate approach that uses coordination between correlated processes but retains the scalability advantage of message logging between independent ones. The algorithm still belongs to the family of event logging protocols but eliminates the need for costly payload logging between coordinated processes. Copyright © 2012 John Wiley & Sons, Ltd. |
---|---|
Bibliography: | ArticleID:CPE2859 ark:/67375/WNG-PM397PK9-0 istex:1F7C355B9776DE131D7950A7BB9E6126336AF643 ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 |
ISSN: | 1532-0626 1532-0634 |
DOI: | 10.1002/cpe.2859 |