A fault detection service for wide area distributed computations

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fau...

Full description

Saved in:
Bibliographic Details
Published inHigh Performance Distributed Computing: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing; 28-31 July 1998 pp. 268 - 278
Main Authors Stelling, P., Foster, I., Kesselman, C., Lee, C., Von Laszewski, G.
Format Conference Proceeding
LanguageEnglish
Published IEEE 1998
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.
Bibliography:SourceType-Conference Papers & Proceedings-1
ObjectType-Conference Paper-1
content type line 25
ISBN:0818685794
9780818685798
ISSN:1082-8907
DOI:10.1109/HPDC.1998.709981