A fault detection service for wide area distributed computations

The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fau...

Full description

Saved in:
Bibliographic Details
Published inHigh Performance Distributed Computing: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing; 28-31 July 1998 pp. 268 - 278
Main Authors Stelling, P., Foster, I., Kesselman, C., Lee, C., Von Laszewski, G.
Format Conference Proceeding
LanguageEnglish
Published IEEE 1998
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.
AbstractList The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver.
Author Kesselman, C.
Stelling, P.
Lee, C.
Von Laszewski, G.
Foster, I.
Author_xml – sequence: 1
  givenname: P.
  surname: Stelling
  fullname: Stelling, P.
  organization: Aerosp. Corp., El Segundo, CA, USA
– sequence: 2
  givenname: I.
  surname: Foster
  fullname: Foster, I.
– sequence: 3
  givenname: C.
  surname: Kesselman
  fullname: Kesselman, C.
– sequence: 4
  givenname: C.
  surname: Lee
  fullname: Lee, C.
– sequence: 5
  givenname: G.
  surname: Von Laszewski
  fullname: Von Laszewski, G.
BookMark eNot0DFPwzAUBGBLFIm2sCMmT2wpz8-p7WxUpVCkSjDAHLnOs2SUJsV2QPx7isJyt3y64WZs0vUdMXYtYCEEVHfb14f1QlSVWWg4pThjMzDCKLPUVTlhUwEGC1OBvmCzlD4AEATqKbtfcW-HNvOGMrkc-o4nil_BEfd95N-hIW4jWd6ElGPYD5ka7vrDccj2T6dLdu5tm-jqv-fs_XHztt4Wu5en5_VqVwQsdS6sINorbMQeVGlQKbTl0ijnyRrtSkSihjxKC1JodJIAnCavjUeHjqScs9tx9xj7z4FSrg8hOWpb21E_pFoKoZcKyxO8GWEgovoYw8HGn3p8Rf4CIARYbQ
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/HPDC.1998.709981
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library
IEEE Proceedings Order Plans (POP All) 1998-Present
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Computer and Information Systems Abstracts
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library Online
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Mathematics
Computer Science
EndPage 278
ExternalDocumentID 709981
Genre Conference Paper
GroupedDBID 29P
6IE
6IF
6IK
6IL
6IN
AAJGR
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IPLJI
JC5
M43
OCL
RIE
RIL
RNS
7SC
8FD
ACGHX
JQ2
L7M
L~C
L~D
RIB
RIC
ID FETCH-LOGICAL-i247t-a1eeb62d1b06482662a4586cfea87c422eedef23a03172c3e00c7ef78f2c2ce33
IEDL.DBID RIE
ISBN 0818685794
9780818685798
ISSN 1082-8907
IngestDate Fri Apr 12 06:55:21 EDT 2024
Wed Jun 26 19:26:46 EDT 2024
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i247t-a1eeb62d1b06482662a4586cfea87c422eedef23a03172c3e00c7ef78f2c2ce33
Notes SourceType-Conference Papers & Proceedings-1
ObjectType-Conference Paper-1
content type line 25
OpenAccessLink https://digital.library.unt.edu/ark:/67531/metadc622574/m2/1/high_res_d/10848.pdf
PQID 31175624
PQPubID 23500
PageCount 11
ParticipantIDs ieee_primary_709981
proquest_miscellaneous_31175624
PublicationCentury 1900
PublicationDate 19980000
19980728
PublicationDateYYYYMMDD 1998-01-01
1998-07-28
PublicationDate_xml – year: 1998
  text: 19980000
PublicationDecade 1990
PublicationTitle High Performance Distributed Computing: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing; 28-31 July 1998
PublicationTitleAbbrev HPDC
PublicationYear 1998
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0020127
ssj0001969105
Score 1.7340665
Snippet The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist...
SourceID proquest
ieee
SourceType Aggregation Database
Publisher
StartPage 268
SubjectTerms Application software
Computer networks
Computer science
Costs
Distributed computing
Fault detection
Grid computing
Laboratories
Mathematics
Resource management
Title A fault detection service for wide area distributed computations
URI https://ieeexplore.ieee.org/document/709981
https://search.proquest.com/docview/31175624
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjZ07T8MwEIAt6MZSKEUU8fDAmjS14zjeQIWqQirqQKVukR8XqQKliCZC4tfjR9pKwMCWDHn4ZN-dfXffIXSb8VKkpfXcmNAqSjlPIiVYEkmr-7i1GAI8S2_2nE0X6dOSLVvOtq-FAQCffAaxu_SxfLPWjTsqG3Lrzrgy60MuRCjV2h-niMwavj1mz0VUQ269XfB2A-jRjw4Nz3jLA8x39_k2fJmI4XT-MHYVfHkcPtY2Xfmlqb35mXRDXffGUwtd1slr3NQq1l8_mI7_HNkx6u_r_PB8Z8FO0AFUPdTdNnrA7brvoaPZDu66OUV397iUzVuNDdQ-kavCm6BxsPWA8efKAJbWF8XGQXldPy0wWPuXhuPBPlpMHl_G06htxBCtSMrrSI4AVEbMSFkHxu5HMiJTlme6BJlznRJifxNKQqXVEJxoCkmiOZQ8L4kmGig9Q51qXcE5wobI0lDCwUF4tLTDVizXjBFFpTB0NEA9J53iPbA2iiCYAbrZir-w09_FNGQF62ZTUIcazUh68edzl6hTfzRwZf2GWl37GfMNize9oQ
link.rule.ids 310,311,783,787,792,793,799,4057,4058,27937,55086
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZQGYClUIoor3pgTZs6DycbqFAVaKsOrdQtcuyLVIFSRBMh8es522krAQNbMuRhy777fHffd4TchjyL_QyRWxDL1PE5d500DlxHoO3j6DFiMFp640k4nPvPi2BR6WwbLgwAmOIz6OhLk8tXK1nqUFmXI5zRNOt9hNVRaMlau4BKHKLr2wnt6Zyqra7HLY9HQCP-qMXhA14pAkbb-2iTwHTj7nD60NccvqhjP1e1Xfllq40DGtQts3ttdAt13clrpyzSjvz6oer4z7Edk-aO6UenWx92QvYgb5D6ptUDrXZ-gxyNt_Ku61Nyd08zUb4VVEFhSrlyurY2hyIGpp9LBVQgGqVKy_LqjlqgqDQvtQHCJpkPHmf9oVO1YnCWzOeFI3oAachUL0UIgyeSkAk_iEKZgYi49BnD34SMeQJtBGfSA9eVHDIeZUwyCZ53Rmr5KodzQhUTmfIYBy3DIwUOOw0iGQQs9USsvF6LNPTsJO9WbSOxE9Mi7c30J7gBdFZD5LAq14mnxUZD5l_8-VybHAxn41Eyepq8XJJDyyjUAZQrUis-SrhGSFGkN2YpfQNgKMMI
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings.+The+Seventh+International+Symposium+on+High+Performance+Distributed+Computing+%28Cat.+No.98TB100244%29&rft.atitle=A+fault+detection+service+for+wide+area+distributed+computations&rft.au=Stelling%2C+P.&rft.au=Foster%2C+I.&rft.au=Kesselman%2C+C.&rft.au=Lee%2C+C.&rft.date=1998-01-01&rft.pub=IEEE&rft.isbn=9780818685798&rft.issn=1082-8907&rft.spage=268&rft.epage=278&rft_id=info:doi/10.1109%2FHPDC.1998.709981&rft.externalDocID=709981
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1082-8907&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1082-8907&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1082-8907&client=summon