Reliable and Efficient Distributed Checkpointing System for Grid Environments
In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mec...
Saved in:
Published in | Journal of grid computing Vol. 12; no. 4; pp. 593 - 613 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Dordrecht
Springer Netherlands
01.12.2014
Springer Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called
Falcon
that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with
Falcon
in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 LLNL-JRNL-649440 National Science Foundation (NSF) Purdue Research Foundation USDOE National Nuclear Security Administration (NNSA) AC52-07NA27344; 0751153-CNS; 0707931-CNS; 0833115-CCF |
ISSN: | 1570-7873 1572-9184 |
DOI: | 10.1007/s10723-014-9297-4 |