Reliable and Efficient Distributed Checkpointing System for Grid Environments

In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mec...

Full description

Saved in:
Bibliographic Details
Published inJournal of grid computing Vol. 12; no. 4; pp. 593 - 613
Main Authors Islam, Tanzima Zerin, Bagchi, Saurabh, Eigenmann, Rudolf
Format Journal Article
LanguageEnglish
Published Dordrecht Springer Netherlands 01.12.2014
Springer
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In Fine-Grained Cycle Sharing (FGCS) systems, machine owners voluntarily share their unused CPU cycles with guest jobs, as long as their performance degradation is tolerable. However, unpredictable evictions of guest jobs lead to fluctuating completion times. Checkpoint-recovery is an attractive mechanism for recovering from such “failures”. Today’s FGCS systems often use expensive, high-performance dedicated checkpoint servers. However, in geographically distributed clusters, this may incur high checkpoint transfer latencies. In this paper we present a distributed checkpointing system called Falcon that uses available disk resources of the FGCS machines as shared checkpoint repositories. However, an unavailable storage host may lead to loss of checkpoint data. Therefore, we model the failures of a storage host and develop a prediction algorithm for choosing reliable checkpoint repositories. We experiment with Falcon in the university-wide Condor testbed at Purdue and show improved and consistent performance for guest jobs in the presence of irregular resource availability.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
LLNL-JRNL-649440
National Science Foundation (NSF)
Purdue Research Foundation
USDOE National Nuclear Security Administration (NNSA)
AC52-07NA27344; 0751153-CNS; 0707931-CNS; 0833115-CCF
ISSN:1570-7873
1572-9184
DOI:10.1007/s10723-014-9297-4