A fault detection service for wide area distributed computations
The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fau...
Saved in:
Published in | High Performance Distributed Computing: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing; 28-31 July 1998 pp. 268 - 278 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
1998
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver. |
---|---|
AbstractList | The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist for detecting and correcting faults, the implementation of these techniques in a particular context can be difficult. Hence, we propose a fault detection service designed to be incorporated, in a modular fashion, into distributed computing systems, tools, or applications. This service uses well-known techniques based on unreliable fault detectors to detect and report component failure, while allowing the user to tradeoff timeliness of reporting against false positive rates. We describe the architecture of this service, report on experimental results that quantify its cost and accuracy, and describe its use in two applications, monitoring the status of system components of the GUSTO computational grid testbed and as part of the NetSolve network-enabled numerical solver. |
Author | Kesselman, C. Stelling, P. Lee, C. Von Laszewski, G. Foster, I. |
Author_xml | – sequence: 1 givenname: P. surname: Stelling fullname: Stelling, P. organization: Aerosp. Corp., El Segundo, CA, USA – sequence: 2 givenname: I. surname: Foster fullname: Foster, I. – sequence: 3 givenname: C. surname: Kesselman fullname: Kesselman, C. – sequence: 4 givenname: C. surname: Lee fullname: Lee, C. – sequence: 5 givenname: G. surname: Von Laszewski fullname: Von Laszewski, G. |
BookMark | eNot0DFPwzAUBGBLFIm2sCMmT2wpz8-p7WxUpVCkSjDAHLnOs2SUJsV2QPx7isJyt3y64WZs0vUdMXYtYCEEVHfb14f1QlSVWWg4pThjMzDCKLPUVTlhUwEGC1OBvmCzlD4AEATqKbtfcW-HNvOGMrkc-o4nil_BEfd95N-hIW4jWd6ElGPYD5ka7vrDccj2T6dLdu5tm-jqv-fs_XHztt4Wu5en5_VqVwQsdS6sINorbMQeVGlQKbTl0ijnyRrtSkSihjxKC1JodJIAnCavjUeHjqScs9tx9xj7z4FSrg8hOWpb21E_pFoKoZcKyxO8GWEgovoYw8HGn3p8Rf4CIARYbQ |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/HPDC.1998.709981 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore Digital Library IEEE Proceedings Order Plans (POP All) 1998-Present Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library Online url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Mathematics Computer Science |
EndPage | 278 |
ExternalDocumentID | 709981 |
Genre | Conference Paper |
GroupedDBID | 29P 6IE 6IF 6IK 6IL 6IN AAJGR ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IPLJI JC5 M43 OCL RIE RIL RNS 7SC 8FD ACGHX JQ2 L7M L~C L~D RIB RIC |
ID | FETCH-LOGICAL-i247t-a1eeb62d1b06482662a4586cfea87c422eedef23a03172c3e00c7ef78f2c2ce33 |
IEDL.DBID | RIE |
ISBN | 0818685794 9780818685798 |
ISSN | 1082-8907 |
IngestDate | Fri Apr 12 06:55:21 EDT 2024 Wed Jun 26 19:26:46 EDT 2024 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i247t-a1eeb62d1b06482662a4586cfea87c422eedef23a03172c3e00c7ef78f2c2ce33 |
Notes | SourceType-Conference Papers & Proceedings-1 ObjectType-Conference Paper-1 content type line 25 |
OpenAccessLink | https://digital.library.unt.edu/ark:/67531/metadc622574/m2/1/high_res_d/10848.pdf |
PQID | 31175624 |
PQPubID | 23500 |
PageCount | 11 |
ParticipantIDs | ieee_primary_709981 proquest_miscellaneous_31175624 |
PublicationCentury | 1900 |
PublicationDate | 19980000 19980728 |
PublicationDateYYYYMMDD | 1998-01-01 1998-07-28 |
PublicationDate_xml | – year: 1998 text: 19980000 |
PublicationDecade | 1990 |
PublicationTitle | High Performance Distributed Computing: Proceedings of the 7th IEEE International Symposium on High Performance Distributed Computing; 28-31 July 1998 |
PublicationTitleAbbrev | HPDC |
PublicationYear | 1998 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0020127 ssj0001969105 |
Score | 1.7340665 |
Snippet | The potential for faults in distributed computing systems is a significant complicating factor for application developers. While a variety of techniques exist... |
SourceID | proquest ieee |
SourceType | Aggregation Database Publisher |
StartPage | 268 |
SubjectTerms | Application software Computer networks Computer science Costs Distributed computing Fault detection Grid computing Laboratories Mathematics Resource management |
Title | A fault detection service for wide area distributed computations |
URI | https://ieeexplore.ieee.org/document/709981 https://search.proquest.com/docview/31175624 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjZ07T8MwEIAt6MZSKEUU8fDAmjS14zjeQIWqQirqQKVukR8XqQKliCZC4tfjR9pKwMCWDHn4ZN-dfXffIXSb8VKkpfXcmNAqSjlPIiVYEkmr-7i1GAI8S2_2nE0X6dOSLVvOtq-FAQCffAaxu_SxfLPWjTsqG3Lrzrgy60MuRCjV2h-niMwavj1mz0VUQ269XfB2A-jRjw4Nz3jLA8x39_k2fJmI4XT-MHYVfHkcPtY2Xfmlqb35mXRDXffGUwtd1slr3NQq1l8_mI7_HNkx6u_r_PB8Z8FO0AFUPdTdNnrA7brvoaPZDu66OUV397iUzVuNDdQ-kavCm6BxsPWA8efKAJbWF8XGQXldPy0wWPuXhuPBPlpMHl_G06htxBCtSMrrSI4AVEbMSFkHxu5HMiJTlme6BJlznRJifxNKQqXVEJxoCkmiOZQ8L4kmGig9Q51qXcE5wobI0lDCwUF4tLTDVizXjBFFpTB0NEA9J53iPbA2iiCYAbrZir-w09_FNGQF62ZTUIcazUh68edzl6hTfzRwZf2GWl37GfMNize9oQ |
link.rule.ids | 310,311,783,787,792,793,799,4057,4058,27937,55086 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZQGYClUIoor3pgTZs6DycbqFAVaKsOrdQtcuyLVIFSRBMh8es522krAQNbMuRhy777fHffd4TchjyL_QyRWxDL1PE5d500DlxHoO3j6DFiMFp640k4nPvPi2BR6WwbLgwAmOIz6OhLk8tXK1nqUFmXI5zRNOt9hNVRaMlau4BKHKLr2wnt6Zyqra7HLY9HQCP-qMXhA14pAkbb-2iTwHTj7nD60NccvqhjP1e1Xfllq40DGtQts3ttdAt13clrpyzSjvz6oer4z7Edk-aO6UenWx92QvYgb5D6ptUDrXZ-gxyNt_Ku61Nyd08zUb4VVEFhSrlyurY2hyIGpp9LBVQgGqVKy_LqjlqgqDQvtQHCJpkPHmf9oVO1YnCWzOeFI3oAachUL0UIgyeSkAk_iEKZgYi49BnD34SMeQJtBGfSA9eVHDIeZUwyCZ53Rmr5KodzQhUTmfIYBy3DIwUOOw0iGQQs9USsvF6LNPTsJO9WbSOxE9Mi7c30J7gBdFZD5LAq14mnxUZD5l_8-VybHAxn41Eyepq8XJJDyyjUAZQrUis-SrhGSFGkN2YpfQNgKMMI |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings.+The+Seventh+International+Symposium+on+High+Performance+Distributed+Computing+%28Cat.+No.98TB100244%29&rft.atitle=A+fault+detection+service+for+wide+area+distributed+computations&rft.au=Stelling%2C+P.&rft.au=Foster%2C+I.&rft.au=Kesselman%2C+C.&rft.au=Lee%2C+C.&rft.date=1998-01-01&rft.pub=IEEE&rft.isbn=9780818685798&rft.issn=1082-8907&rft.spage=268&rft.epage=278&rft_id=info:doi/10.1109%2FHPDC.1998.709981&rft.externalDocID=709981 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1082-8907&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1082-8907&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1082-8907&client=summon |