The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors
Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have b...
Saved in:
Published in | Computing Vol. 105; no. 12; pp. 2821 - 2845 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Vienna
Springer Vienna
01.12.2023
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the
de facto
standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors. |
---|---|
AbstractList | Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors. Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based approach to monitor and identify faulty components of a general system. Over the last decades, several diagnosis models and strategies have been proposed, based on different fault models, and applied to the most diverse types of computer systems. In the 1990s, unreliable failure detectors emerged as an abstraction to enable consensus in asynchronous systems subject to crash faults. Since then, failure detectors have become the de facto standard for monitoring distributed systems. The purpose of the present work is to fill a conceptual gap by presenting a distributed diagnosis model that is consistent with unreliable failure detectors. Properties are proven for the number of tests/monitoring messages required, latency for event detection, as well as completeness and accuracy. Three different failure detectors compliant with the proposed model are presented, including vRing and vCube, which provide scalable alternatives to the traditional all-monitor-all strategy adopted by most existing failure detectors. |
Author | Turchetti, Rogério C. Duarte, Elias P. Camargo, Edson T. Rodrigues, Luiz A. |
Author_xml | – sequence: 1 givenname: Elias P. orcidid: 0000-0002-8916-3302 surname: Duarte fullname: Duarte, Elias P. email: elias@inf.ufpr.br organization: Federal University of Paraná – sequence: 2 givenname: Luiz A. orcidid: 0000-0002-9516-1282 surname: Rodrigues fullname: Rodrigues, Luiz A. organization: Western Paraná State University (UNIOESTE) – sequence: 3 givenname: Edson T. orcidid: 0000-0002-6520-9142 surname: Camargo fullname: Camargo, Edson T. organization: Technological Federal University of Paraná (UTFPR) – sequence: 4 givenname: Rogério C. orcidid: 0000-0002-5242-5057 surname: Turchetti fullname: Turchetti, Rogério C. organization: Federal University of Santa Maria (UFSM) |
BookMark | eNp9kE9rGzEQxUVIobbTL9CToGelI2n_yLkV06SBQC4-5Ca0uyNXZldyJG3B375qthDIwadhmPebmffW5NoHj4R85XDLAdrvCaCBloGQDLjgnKkrsuKVbFgNdXtNVgAcWKXql89kndIRoEjVdkXy_jfSyaXk_IGeHPZ4Rw0dXMrRdXPGgaZzyjixEf_gWAbm4ENyiU5hKL0NkeaywU2nESf02WQXPA2Wzj7i6Ew3IrXGjXNEOmDGPoeYbsgna8aEX_7XDdnf_9zvfrGn54fH3Y8n1ku-zQzrbW0G1YtWdrVQxWFTD23FremQNy02FhQYaKzAwfSdlR1uq0EK6Cph-05uyLdl7SmG1xlT1scwR18uaqHaFiQXQhaVWFR9DClFtPoU3WTiWXPQ_8LVS7i6JKbfwtWqQOoD1LvFe47F7WVULmgqd_wB4_tXF6i_rweS3g |
CitedBy_id | crossref_primary_10_3390_math12040597 crossref_primary_10_1016_j_jpdc_2023_104789 crossref_primary_10_1016_j_adhoc_2024_103461 |
Cites_doi | 10.1109/TPDS.2016.2524004 10.1145/1922649.1922659 10.1109/T-C.1974.223782 10.1109/ICOIN.2001.905471 10.1016/j.jpdc.2018.10.011 10.1145/383962.384010 10.1016/j.ins.2020.08.068 10.1109/DSN.2002.1028911 10.1109/TC.1984.1676420 10.1109/12.142688 10.1145/234533.234549 10.1109/AINA.2016.73 10.1109/JIOT.2020.3032544 10.1145/226643.226647 10.1109/TPDS.2023.3242089 10.1016/j.dam.2023.05.029 10.1109/MPDT.1996.7102341 10.1109/PGEC.1967.264748 10.1145/3149.214121 10.1109/DSN.2002.1028920 10.1186/s13173-018-0069-z 10.1109/JCN.2020.000023 10.1109/TPDS.2011.284 10.1109/ScalA.2014.14 10.1186/s13174-016-0051-y 10.1002/nem.1988 10.1109/ICPADS.2005.130 10.1109/12.656078 10.1109/DSN.2004.1311919 10.1145/1011767.1011818 10.1145/1052796.1052806 10.1109/TC.1984.1676419 10.1109/12.980014 10.1109/TDSC.2004.2 |
ContentType | Journal Article |
Copyright | The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
Copyright_xml | – notice: The Author(s), under exclusive licence to Springer-Verlag GmbH Austria, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. |
DBID | AAYXX CITATION 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8AO 8FD 8FE 8FG 8FK 8FL 8G5 ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FRNLG F~G GNUQQ GUQSH HCIFZ JQ2 K60 K6~ K7- L.- L7M L~C L~D M0C M0N M2O MBDVC P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI Q9U |
DOI | 10.1007/s00607-023-01211-8 |
DatabaseName | CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Computing Database (Alumni Edition) ProQuest Pharma Collection Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni) ProQuest Research Library ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Aerospace Collection ProQuest Central Essentials - QC ProQuest Central Business Premium Collection Technology Collection ProQuest One ProQuest Central Korea Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student ProQuest Research Library SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Research Library Research Library (Corporate) Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central Basic |
DatabaseTitle | CrossRef ABI/INFORM Global (Corporate) ProQuest Business Collection (Alumni Edition) ProQuest One Business Research Library Prep Computer Science Database ProQuest Central Student Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College Research Library (Alumni Edition) ProQuest Pharma Collection ABI/INFORM Complete ProQuest Central ABI/INFORM Professional Advanced ProQuest One Applied & Life Sciences ProQuest Central Korea ProQuest Research Library ProQuest Central (New) Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest Computing ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest SciTech Collection ProQuest Business Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition ProQuest One Business (Alumni) ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) Business Premium Collection (Alumni) |
DatabaseTitleList | ABI/INFORM Global (Corporate) |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Mathematics Computer Science |
EISSN | 1436-5057 |
EndPage | 2845 |
ExternalDocumentID | 10_1007_s00607_023_01211_8 |
GrantInformation_xml | – fundername: Fundação de Amparo á Pesquisa do Estado de São Paulo grantid: 2021/06923-0 funderid: http://dx.doi.org/10.13039/501100001807 – fundername: Conselho Nacional de Desenvolvimento Científico e Tecnológico grantid: 308959/2020-5 funderid: http://dx.doi.org/10.13039/501100003593 |
GroupedDBID | -4Z -59 -5G -BR -EM -Y2 -~C -~X .4S .86 .DC .VR 06D 0R~ 0VY 1N0 1SB 2.D 203 28- 29F 2J2 2JN 2JY 2KG 2KM 2LR 2P1 2VQ 2~H 30V 3V. 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 6TJ 78A 7WY 8AO 8FE 8FG 8FL 8G5 8TC 8UJ 8VB 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDBF ABDZT ABECU ABFTD ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACHSB ACHXU ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACUHS ACZOJ ADHHG ADHIR ADIMF ADINQ ADKNI ADKPE ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMOZ AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFFNX AFGCZ AFKRA AFLOW AFQWF AFWTZ AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHQJS AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ AKVCP ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARCSS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN AZQEC B-. B0M BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BKOMP BPHCQ BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DWQXO EAD EAP EBA EBLON EBR EBS EBU ECS EDO EIOEI EJD EMK EPL ESBYG EST ESX FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GUQSH GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I09 IHE IJ- IKXTQ ITG ITH ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K1G K60 K6V K6~ K7- KDC KOV KOW LAS LLZTM M0C M0N M2O M4Y MA- MK~ ML~ N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM P19 P2P P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PT5 Q2X QOK QOS QWB R4E R89 R9I RHV RIG RNI RNS ROL RPX RSV RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TH9 TN5 TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z7R Z7X Z7Z Z81 Z83 Z88 Z8M Z8N Z8R Z8T Z8U Z8W Z92 ZL0 ZMTXR ~8M ~EX AAPKM AAYXX ABBRH ABDBE ABFSG ACSTC ADHKG AEZWR AFDZB AFHIU AFOHR AGQPQ AHPBZ AHWEU AIXLP AMVHM ATHPR AYFIA CITATION PHGZM PHGZT 7SC 7XB 8AL 8FD 8FK ABRTQ AFKWF JQ2 L.- L7M L~C L~D MBDVC PKEHL PQEST PQGLB PQUKI Q9U |
ID | FETCH-LOGICAL-c319t-e595ad8c273b52806065d741fabe167e6f080a06f2edacbf3be94d320b42fcb3 |
IEDL.DBID | BENPR |
ISSN | 0010-485X |
IngestDate | Sat Aug 23 14:29:29 EDT 2025 Thu Apr 24 22:51:45 EDT 2025 Tue Jul 01 02:33:39 EDT 2025 Fri Feb 21 02:42:30 EST 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 12 |
Keywords | Fault tolerance System-level diagnosis Fault management Distributed systems Fault monitoring Failure detection 68M15 68M14 |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c319t-e595ad8c273b52806065d741fabe167e6f080a06f2edacbf3be94d320b42fcb3 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0002-5242-5057 0000-0002-8916-3302 0000-0002-9516-1282 0000-0002-6520-9142 |
PQID | 2877031223 |
PQPubID | 48322 |
PageCount | 25 |
ParticipantIDs | proquest_journals_2877031223 crossref_primary_10_1007_s00607_023_01211_8 crossref_citationtrail_10_1007_s00607_023_01211_8 springer_journals_10_1007_s00607_023_01211_8 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 20231200 2023-12-00 20231201 |
PublicationDateYYYYMMDD | 2023-12-01 |
PublicationDate_xml | – month: 12 year: 2023 text: 20231200 |
PublicationDecade | 2020 |
PublicationPlace | Vienna |
PublicationPlace_xml | – name: Vienna – name: Wien |
PublicationTitle | Computing |
PublicationTitleAbbrev | Computing |
PublicationYear | 2023 |
Publisher | Springer Vienna Springer Nature B.V |
Publisher_xml | – name: Springer Vienna – name: Springer Nature B.V |
References | GuoCWuCXiaoZLuJLiuZThe intermittent diagnosability for two families of interconnection networks under the PMC model and mm* modelDiscret Appl Math20233398910610.1016/j.dam.2023.05.029 ReynalMA short introduction to failure detectors for asynchronous distributed systemsSIGACT News2005361537010.1145/1052796.1052806 NeumannJShannonCEMcCarthyJProbabilistic logics and the synthesis of reliable organisms from unreliable components1956PrincetonPrinceton University Press4398 Beyer B, Jones C, Petoff J, Murphy NR (2016) Site reliability engineering: how Google runs production systems. O’Reilly, Sebastopol, United States http://landing.google.com/sre/book.html PreparataFPMetzeGChienRTOn the connection assignment problem of diagnosable systemsIEEE Trans Electron Comput196716684885410.1109/PGEC.1967.2647480189.16904 BianchiniRPBuskensRWImplementation of online distributed system-level diagnosis theoryIEEE Trans Comput199241561662610.1109/12.142688 Hosseini, Kuhl, Reddy (1984) A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans Comput 33(3):223–233. https://doi.org/10.1109/TC.1984.1676419 Gupta I, Chandra TD, Goldszmidt GS (2001) On scalable and efficient distributed failure detectors. In: 20th PODCP, ACM, New York, pp 170–179 https://doi.org/10.1145/383962.384010 MassonGMBloughDMSullivanGFPradhanDKSystem diagnosis1996USAPrentice-Hall Inc478536 HakimiSLAminATCharacterization of connection assignment of diagnosable systemsIEEE Trans Comput1974231868835417110.1109/T-C.1974.2237820278.94018 Duarte EP, Bona LCE, Ruoso VK (2014) Vcube: a provably scalable distributed diagnosis algorithm. In: 2014 5th Workshop on latest advances in scalable algorithms for large-scale systems, pp 17–22. https://doi.org/10.1109/ScalA.2014.14 SongJLinLHuangYHsiehSYIntermittent fault diagnosis of split-star networks and its applicationsIEEE Trans Parallel Distrib Syst20233441253126410.1109/TPDS.2023.3242089 Duarte Jr EP, Santini R, Cohen J (2004) Delivering packets during the routing convergence latency interval through highly connected detours. In: DSN, pp 495–504. https://doi.org/10.1109/DSN.2004.1311919 JanSULeeYDKooISA distributed sensor-fault detection and diagnosis framework using machine learningInf Sci2021547777796414828410.1016/j.ins.2020.08.068 FischerMJLynchNAImpossibility of distributed consensus with one faulty processJ ACM198532237438283186510.1145/3149.2141210629.68027 NYT: gone in minutes, out for hours: outage shakes facebook (2021) https://www.nytimes.com/2021/10/04/technology/facebook-down.html HakimiNOn adaptive system diagnosisIEEE Trans Comput198433323424075855310.1109/TC.1984.16764200528.94031 TurchettiRCDuarteEPNFV-FD: implementation of a failure detector using network virtualization technologyInt J Netw Manag2017276198810.1002/nem.1988 TurchettiRCDuarteEPArantesLSensPA QoS-configurable failure detection service for internet applicationsJ Internet Serv Appl (JISA)20167111410.1186/s13174-016-0051-y Urban P, Defago X, Schiper A (2001) Neko: a single environment to simulate and prototype distributed algorithms. In: 15th ICOIN, pp 503–511. https://doi.org/10.1109/ICOIN.2001.905471 Delporte-Gallet C, Fauconnier H, Guerraoui R, Hadzilacos V, Kouznetsov P, Toueg S (2004) The weakest failure detectors to solve certain fundamental problems in distributed computing, ACM, New York, pp. 338–346 https://doi.org/10.1145/1011767.1011818 ChenWTouegSAguileraMKOn the quality of service of failure detectorsIEEE Trans Comput20025111332187331410.1109/12.9800141391.94923 DuarteEPNanyaTA hierarchical adaptive distributed system-level diagnosis algorithmIEEE Trans Comput1998471344510.1109/12.656078 Ziwich RP, Duarte EP, Albini LCP (2005) Distributed integrity checking for systems with replicated data. In: 11th ICPADS’05, vol 1, pp 363–3691. https://doi.org/10.1109/ICPADS.2005.130 DuarteEPWeberAFonsecaKVODistributed diagnosis of dynamic events in partitionable arbitrary topology networksIEEE Trans Parallel Distrib20122381415142610.1109/TPDS.2011.284 Bertier M, Marin O, Sens P (2002) Implementation and performance evaluation of an adaptable failure detector. In: DSN, pp 354–363. https://doi.org/10.1109/DSN.2002.1028920 Codestone: the true impact of IT failures (2017) https://www.codestone.net/our-thoughts/true-impact-of-it-failures ChandraTDTouegSUnreliable failure detectors for reliable distributed systemsJ ACM1996432225267140832210.1145/226643.2266470885.68021 Rodrigues LA, Arantes L, Duarte EP (2016) An autonomic majority quorum system. In: 2016 IEEE 30th international conference on advanced information networking and applications (AINA), IEEE, pp 524–531. https://doi.org/10.1109/AINA.2016.73 CamargoETDuarteEPRunning resilient MPI applications on a dynamic group of recommended processesJ Braz Comput Soc201824111610.1186/s13173-018-0069-z AvizienisALaprieJ-CRandellBLandwehrCBasic concepts and taxonomy of dependable and secure computingIEEE Trans Dep Secure Comput200411113310.1109/TDSC.2004.2 BuiKTVan VoLNguyenCMPhamTVTranHCA fault detection and diagnosis approach for multi-tier application in cloud computingJ Commun Net202022539941410.1109/JCN.2020.000023 Duarte EP, De Bona LCE (2002) A dependable snmp-based tool for distributed network management. In: DSN, IEEE, pp 279–284. https://doi.org/10.1109/DSN.2002.1028911 ZiwichRPA nearly optimal comparison-based diagnosis algorithm for systems of arbitrary topologyIEEE Trans Parallel Distrib201627113131314310.1109/TPDS.2016.2524004 AraujoJPArantesLDuarteEPJrRodriguesLASensPVCube-PS: a causal broadcast topic-based publish/subscribe systemJ Parallel Distrib Comput2019125183010.1016/j.jpdc.2018.10.011 ZhangWLuQYuQBlockchain-based federated learning for device failure detection in industrial IoTIEEE Internet Things J2020875926593710.1109/JIOT.2020.3032544 DuarteEPZiwichRPAlbiniLCPA survey of comparison-based system-level diagnosisACM Comput Surv201110.1145/1922649.19226591293.68064 ChandraTDHadzilacosVTouegSThe weakest failure detector for solving consensusJ ACM1996434685722140921510.1145/234533.2345490885.68022 JhaNKFault-tolerant computer system designIEEE Parallel Distrib Technol Syst Appl199644848410.1109/MPDT.1996.7102341 GM Masson (1211_CR10) 1996 RC Turchetti (1211_CR16) 2017; 27 TD Chandra (1211_CR34) 1996; 43 W Zhang (1211_CR39) 2020; 8 RP Ziwich (1211_CR29) 2016; 27 N Hakimi (1211_CR19) 1984; 33 EP Duarte (1211_CR11) 2011 RP Bianchini (1211_CR21) 1992; 41 M Reynal (1211_CR8) 2005; 36 1211_CR14 1211_CR36 1211_CR17 ET Camargo (1211_CR28) 2018; 24 1211_CR33 NK Jha (1211_CR6) 1996; 4 1211_CR30 W Chen (1211_CR35) 2002; 51 A Avizienis (1211_CR4) 2004; 1 J Song (1211_CR31) 2023; 34 TD Chandra (1211_CR13) 1996; 43 SL Hakimi (1211_CR18) 1974; 23 EP Duarte (1211_CR27) 2012; 23 FP Preparata (1211_CR9) 1967; 16 SU Jan (1211_CR37) 2021; 547 1211_CR5 1211_CR7 1211_CR2 J Neumann (1211_CR3) 1956 1211_CR1 1211_CR25 1211_CR24 1211_CR23 MJ Fischer (1211_CR12) 1985; 32 1211_CR20 JP Araujo (1211_CR26) 2019; 125 EP Duarte (1211_CR22) 1998; 47 KT Bui (1211_CR38) 2020; 22 C Guo (1211_CR32) 2023; 339 RC Turchetti (1211_CR15) 2016; 7 |
References_xml | – reference: Beyer B, Jones C, Petoff J, Murphy NR (2016) Site reliability engineering: how Google runs production systems. O’Reilly, Sebastopol, United States http://landing.google.com/sre/book.html – reference: GuoCWuCXiaoZLuJLiuZThe intermittent diagnosability for two families of interconnection networks under the PMC model and mm* modelDiscret Appl Math20233398910610.1016/j.dam.2023.05.029 – reference: Urban P, Defago X, Schiper A (2001) Neko: a single environment to simulate and prototype distributed algorithms. In: 15th ICOIN, pp 503–511. https://doi.org/10.1109/ICOIN.2001.905471 – reference: Gupta I, Chandra TD, Goldszmidt GS (2001) On scalable and efficient distributed failure detectors. In: 20th PODCP, ACM, New York, pp 170–179 https://doi.org/10.1145/383962.384010 – reference: ZiwichRPA nearly optimal comparison-based diagnosis algorithm for systems of arbitrary topologyIEEE Trans Parallel Distrib201627113131314310.1109/TPDS.2016.2524004 – reference: DuarteEPZiwichRPAlbiniLCPA survey of comparison-based system-level diagnosisACM Comput Surv201110.1145/1922649.19226591293.68064 – reference: DuarteEPWeberAFonsecaKVODistributed diagnosis of dynamic events in partitionable arbitrary topology networksIEEE Trans Parallel Distrib20122381415142610.1109/TPDS.2011.284 – reference: ChenWTouegSAguileraMKOn the quality of service of failure detectorsIEEE Trans Comput20025111332187331410.1109/12.9800141391.94923 – reference: SongJLinLHuangYHsiehSYIntermittent fault diagnosis of split-star networks and its applicationsIEEE Trans Parallel Distrib Syst20233441253126410.1109/TPDS.2023.3242089 – reference: BuiKTVan VoLNguyenCMPhamTVTranHCA fault detection and diagnosis approach for multi-tier application in cloud computingJ Commun Net202022539941410.1109/JCN.2020.000023 – reference: ZhangWLuQYuQBlockchain-based federated learning for device failure detection in industrial IoTIEEE Internet Things J2020875926593710.1109/JIOT.2020.3032544 – reference: Duarte EP, De Bona LCE (2002) A dependable snmp-based tool for distributed network management. In: DSN, IEEE, pp 279–284. https://doi.org/10.1109/DSN.2002.1028911 – reference: JhaNKFault-tolerant computer system designIEEE Parallel Distrib Technol Syst Appl199644848410.1109/MPDT.1996.7102341 – reference: Duarte Jr EP, Santini R, Cohen J (2004) Delivering packets during the routing convergence latency interval through highly connected detours. In: DSN, pp 495–504. https://doi.org/10.1109/DSN.2004.1311919 – reference: FischerMJLynchNAImpossibility of distributed consensus with one faulty processJ ACM198532237438283186510.1145/3149.2141210629.68027 – reference: CamargoETDuarteEPRunning resilient MPI applications on a dynamic group of recommended processesJ Braz Comput Soc201824111610.1186/s13173-018-0069-z – reference: TurchettiRCDuarteEPArantesLSensPA QoS-configurable failure detection service for internet applicationsJ Internet Serv Appl (JISA)20167111410.1186/s13174-016-0051-y – reference: JanSULeeYDKooISA distributed sensor-fault detection and diagnosis framework using machine learningInf Sci2021547777796414828410.1016/j.ins.2020.08.068 – reference: PreparataFPMetzeGChienRTOn the connection assignment problem of diagnosable systemsIEEE Trans Electron Comput196716684885410.1109/PGEC.1967.2647480189.16904 – reference: AvizienisALaprieJ-CRandellBLandwehrCBasic concepts and taxonomy of dependable and secure computingIEEE Trans Dep Secure Comput200411113310.1109/TDSC.2004.2 – reference: ChandraTDHadzilacosVTouegSThe weakest failure detector for solving consensusJ ACM1996434685722140921510.1145/234533.2345490885.68022 – reference: NYT: gone in minutes, out for hours: outage shakes facebook (2021) https://www.nytimes.com/2021/10/04/technology/facebook-down.html – reference: Codestone: the true impact of IT failures (2017) https://www.codestone.net/our-thoughts/true-impact-of-it-failures – reference: ChandraTDTouegSUnreliable failure detectors for reliable distributed systemsJ ACM1996432225267140832210.1145/226643.2266470885.68021 – reference: ReynalMA short introduction to failure detectors for asynchronous distributed systemsSIGACT News2005361537010.1145/1052796.1052806 – reference: Delporte-Gallet C, Fauconnier H, Guerraoui R, Hadzilacos V, Kouznetsov P, Toueg S (2004) The weakest failure detectors to solve certain fundamental problems in distributed computing, ACM, New York, pp. 338–346 https://doi.org/10.1145/1011767.1011818 – reference: Bertier M, Marin O, Sens P (2002) Implementation and performance evaluation of an adaptable failure detector. In: DSN, pp 354–363. https://doi.org/10.1109/DSN.2002.1028920 – reference: HakimiNOn adaptive system diagnosisIEEE Trans Comput198433323424075855310.1109/TC.1984.16764200528.94031 – reference: Ziwich RP, Duarte EP, Albini LCP (2005) Distributed integrity checking for systems with replicated data. In: 11th ICPADS’05, vol 1, pp 363–3691. https://doi.org/10.1109/ICPADS.2005.130 – reference: MassonGMBloughDMSullivanGFPradhanDKSystem diagnosis1996USAPrentice-Hall Inc478536 – reference: BianchiniRPBuskensRWImplementation of online distributed system-level diagnosis theoryIEEE Trans Comput199241561662610.1109/12.142688 – reference: DuarteEPNanyaTA hierarchical adaptive distributed system-level diagnosis algorithmIEEE Trans Comput1998471344510.1109/12.656078 – reference: AraujoJPArantesLDuarteEPJrRodriguesLASensPVCube-PS: a causal broadcast topic-based publish/subscribe systemJ Parallel Distrib Comput2019125183010.1016/j.jpdc.2018.10.011 – reference: Duarte EP, Bona LCE, Ruoso VK (2014) Vcube: a provably scalable distributed diagnosis algorithm. In: 2014 5th Workshop on latest advances in scalable algorithms for large-scale systems, pp 17–22. https://doi.org/10.1109/ScalA.2014.14 – reference: Rodrigues LA, Arantes L, Duarte EP (2016) An autonomic majority quorum system. In: 2016 IEEE 30th international conference on advanced information networking and applications (AINA), IEEE, pp 524–531. https://doi.org/10.1109/AINA.2016.73 – reference: TurchettiRCDuarteEPNFV-FD: implementation of a failure detector using network virtualization technologyInt J Netw Manag2017276198810.1002/nem.1988 – reference: HakimiSLAminATCharacterization of connection assignment of diagnosable systemsIEEE Trans Comput1974231868835417110.1109/T-C.1974.2237820278.94018 – reference: Hosseini, Kuhl, Reddy (1984) A diagnosis algorithm for distributed computing systems with dynamic failure and repair. IEEE Trans Comput 33(3):223–233. https://doi.org/10.1109/TC.1984.1676419 – reference: NeumannJShannonCEMcCarthyJProbabilistic logics and the synthesis of reliable organisms from unreliable components1956PrincetonPrinceton University Press4398 – volume: 27 start-page: 3131 issue: 11 year: 2016 ident: 1211_CR29 publication-title: IEEE Trans Parallel Distrib doi: 10.1109/TPDS.2016.2524004 – year: 2011 ident: 1211_CR11 publication-title: ACM Comput Surv doi: 10.1145/1922649.1922659 – volume: 23 start-page: 86 issue: 1 year: 1974 ident: 1211_CR18 publication-title: IEEE Trans Comput doi: 10.1109/T-C.1974.223782 – ident: 1211_CR36 doi: 10.1109/ICOIN.2001.905471 – volume: 125 start-page: 18 year: 2019 ident: 1211_CR26 publication-title: J Parallel Distrib Comput doi: 10.1016/j.jpdc.2018.10.011 – ident: 1211_CR17 doi: 10.1145/383962.384010 – volume: 547 start-page: 777 year: 2021 ident: 1211_CR37 publication-title: Inf Sci doi: 10.1016/j.ins.2020.08.068 – ident: 1211_CR23 doi: 10.1109/DSN.2002.1028911 – start-page: 478 volume-title: System diagnosis year: 1996 ident: 1211_CR10 – volume: 33 start-page: 234 issue: 3 year: 1984 ident: 1211_CR19 publication-title: IEEE Trans Comput doi: 10.1109/TC.1984.1676420 – volume: 41 start-page: 616 issue: 5 year: 1992 ident: 1211_CR21 publication-title: IEEE Trans Comput doi: 10.1109/12.142688 – volume: 43 start-page: 685 issue: 4 year: 1996 ident: 1211_CR34 publication-title: J ACM doi: 10.1145/234533.234549 – ident: 1211_CR2 – ident: 1211_CR25 doi: 10.1109/AINA.2016.73 – volume: 8 start-page: 5926 issue: 7 year: 2020 ident: 1211_CR39 publication-title: IEEE Internet Things J doi: 10.1109/JIOT.2020.3032544 – volume: 43 start-page: 225 issue: 2 year: 1996 ident: 1211_CR13 publication-title: J ACM doi: 10.1145/226643.226647 – volume: 34 start-page: 1253 issue: 4 year: 2023 ident: 1211_CR31 publication-title: IEEE Trans Parallel Distrib Syst doi: 10.1109/TPDS.2023.3242089 – volume: 339 start-page: 89 year: 2023 ident: 1211_CR32 publication-title: Discret Appl Math doi: 10.1016/j.dam.2023.05.029 – volume: 4 start-page: 84 issue: 4 year: 1996 ident: 1211_CR6 publication-title: IEEE Parallel Distrib Technol Syst Appl doi: 10.1109/MPDT.1996.7102341 – volume: 16 start-page: 848 issue: 6 year: 1967 ident: 1211_CR9 publication-title: IEEE Trans Electron Comput doi: 10.1109/PGEC.1967.264748 – volume: 32 start-page: 374 issue: 2 year: 1985 ident: 1211_CR12 publication-title: J ACM doi: 10.1145/3149.214121 – start-page: 43 volume-title: Probabilistic logics and the synthesis of reliable organisms from unreliable components year: 1956 ident: 1211_CR3 – ident: 1211_CR14 doi: 10.1109/DSN.2002.1028920 – volume: 24 start-page: 1 issue: 1 year: 2018 ident: 1211_CR28 publication-title: J Braz Comput Soc doi: 10.1186/s13173-018-0069-z – volume: 22 start-page: 399 issue: 5 year: 2020 ident: 1211_CR38 publication-title: J Commun Net doi: 10.1109/JCN.2020.000023 – volume: 23 start-page: 1415 issue: 8 year: 2012 ident: 1211_CR27 publication-title: IEEE Trans Parallel Distrib doi: 10.1109/TPDS.2011.284 – ident: 1211_CR24 doi: 10.1109/ScalA.2014.14 – volume: 7 start-page: 1 issue: 1 year: 2016 ident: 1211_CR15 publication-title: J Internet Serv Appl (JISA) doi: 10.1186/s13174-016-0051-y – volume: 27 start-page: 1988 issue: 6 year: 2017 ident: 1211_CR16 publication-title: Int J Netw Manag doi: 10.1002/nem.1988 – ident: 1211_CR30 doi: 10.1109/ICPADS.2005.130 – volume: 47 start-page: 34 issue: 1 year: 1998 ident: 1211_CR22 publication-title: IEEE Trans Comput doi: 10.1109/12.656078 – ident: 1211_CR7 doi: 10.1109/DSN.2004.1311919 – ident: 1211_CR33 doi: 10.1145/1011767.1011818 – volume: 36 start-page: 53 issue: 1 year: 2005 ident: 1211_CR8 publication-title: SIGACT News doi: 10.1145/1052796.1052806 – ident: 1211_CR1 – ident: 1211_CR20 doi: 10.1109/TC.1984.1676419 – ident: 1211_CR5 – volume: 51 start-page: 13 issue: 1 year: 2002 ident: 1211_CR35 publication-title: IEEE Trans Comput doi: 10.1109/12.980014 – volume: 1 start-page: 11 issue: 1 year: 2004 ident: 1211_CR4 publication-title: IEEE Trans Dep Secure Comput doi: 10.1109/TDSC.2004.2 |
SSID | ssj0002389 |
Score | 2.3690174 |
Snippet | Reliable systems require effective monitoring techniques for fault identification. System-level diagnosis was originally proposed in the 1960s as a test-based... |
SourceID | proquest crossref springer |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 2821 |
SubjectTerms | Artificial Intelligence Computer Appl. in Administrative Data Processing Computer Communication Networks Computer networks Computer Science Detectors Distributed systems management Failure Fault detection Fault diagnosis Fault tolerance Information Systems Applications (incl.Internet) Monitoring Regular Paper Sensors Software Engineering System effectiveness |
SummonAdditionalLinks | – databaseName: SpringerLink Journals (ICM) dbid: U2A link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1LSwMxEB60XvTgoypWq-TgTQP7SLaptyKWItRTC70tySaBhdoWd_v_nWQfRVHBc7I55MtkvtnMfANwj6Gx5sOQUx4aSxlTCR1GgaWCCXRvgdBx5oqTp2_JZM5eF3xRF4UVTbZ78yTpb-q22M1Jhwwo-hjqdcmo2IcD7mJ3PMXzaNTev-iEKtKLNwwTfFGXyvy8xld3tOOY355FvbcZn8JxTRPJqML1DPbMqgsnTQsGUltkF46mrexqcQ4lgk4QOBf_k01uMvNEJNFOG9e1tTKaVMLNdOlShXDAp9nlBfH9cAjyV4Krkfy9SSp3qJG1JVs8ZMvcVVkRK3OXyU60Kf0P_-ICZuOX2fOE1m0VaIb2VlLDh1xqkSFxUdw9rCIL0UgsrFQmTAYmscgiZZDYyGiZKRsrM2Q6jgLFIpup-BI6q_XKXAEJY4MEitkokRLHuLIi0mHGA6ZlHA9MD8Jmc9Oslhx3nS-WaSuW7AFJEZDUA5KKHjy032wqwY0_Z_cbzNLa-IoUg0Cnyo_EpwePDY674d9Xu_7f9Bs4dM3nq-SWPnTKj625RYpSqjt_Ij8Bchfc3w priority: 102 providerName: Springer Nature |
Title | The missing piece: a distributed system-level diagnosis model for the implementation of unreliable failure detectors |
URI | https://link.springer.com/article/10.1007/s00607-023-01211-8 https://www.proquest.com/docview/2877031223 |
Volume | 105 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1Lb9swDCbW5rId2q7bsPQR6NDbJswPyVF6KZIuadGhxTC0QHYy9AQMZElau_-_lCwn2ID15ANtHUyK_ESRHwHO8Ghs-CjllKfWUcZUQUdZ4qhgAsNbIkyufXPy7V1x_cBu5nweE251LKvsfGJw1GalfY78GyJ7T7WO0exi_Uj91Ch_uxpHaOxAD12wwMNXbzK9-_lr44sxILUAGL0NE3we22ZC85ynIhlSfIUGnjMq_g5NW7z5zxVpiDyzA9iLkJGMWx2_hzd2eQj73TgGEnfnIby73VCw1h-gQQMgqESfCyDrymp7TiQxnifXj7iyhrQkznThy4ZQEEruqpqE2TgEsSzB1Uj1pysw9xokK0ee0eAWle-4Ik5WvqqdGNuE5H_9Ee5n0_vLaxpHLFCNe6-hlo-4NEIjiFHcX7IiIjEIMpxUNi2GtnCIKGVSuMwaqZXLlR0xk2eJYpnTKv8Eu8vV0n4GkuYWwRRzWSElyrhyIjOp5gkzMs-Htg9p93NLHenH_RSMRbkhTg4KKVEhZVBIKfrwZfPNuiXfePXtk05nZdyIdbk1mz587fS4Ff9_taPXVzuGt37wfFvYcgK7zdOzPUV40qgB7IjZ1QB648n3ycw_r37_mA6iZaL0IRu_AHyv5j0 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1Lb9QwEB6VcgAOPAoVCwV8gBNYJLGddZAQQsCypd2eFmlvlp9SpGV3aVJV_Cj-I2Mn2RVI9NaznTl4Ps-MMzPfALzEp7ETVS6oyH2gnJuSVkUWqOQS3VsmHbOxOXl2Vk6_828LsdiD30MvTCyrHGxiMtRubeM_8rcY2UeqdfRmHzY_aZwaFbOrwwiNDhYn_tclPtma98efUb-vimLyZf5pSvupAtQi3FrqRSW0kxb9thExr4hO2KFfDdr4vBz7MmAQpbMyFN5pawIzvuKOFZnhRbCGodgbcJMzVsULJSdft4YfvV8XbaNp41Is-h6d1KkXeU_GFLfQRKpG5d9-cBfc_pOPTW5uch_u9vEp-dgB6gHs-dUB3BtmP5DeFBzAndmW77V5CC2ijSBi4o8Hsqm99e-IJi6S8sZ5Wt6RjjGaLmONEi6k-r66IWkQD8HAmaA0Uv8YqtkjXMg6kAtE97KO7V0k6DqW0BPn25RpaB7B_DpO_hD2V-uVfwwkZx4jNx6KUmtcEybIwuVWZNxpxsZ-BPlwuMr2XOdx5MZSbVmak0IUKkQlhSg5gtfbbzYd08eVu48Gnan-1jdqh9ERvBn0uFv-v7QnV0t7Abem89mpOj0-O3kKt-PE-66i5gj22_ML_wzjotY8T2gkoK4Z_X8AAHUdVw |
linkToPdf | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV3fTxQxEJ7AkRh8EEWIh6B90Cdt2O22e3smhqBwAZELMZDcW9OfySbn3ckuMf5p_ndOu7t30UTeeG63D51v55u2M98AvMGjsRXDVFCROk851zkdssTTghdIb0lhMxOKky_H-dkN_zIRkzX43dXChLTKzidGR23nJtyRH2JkH6TWkc0OfZsWcXUyOlr8oKGDVHhp7dppNBC5cL9-4vGt-nh-grZ-y9jo9PrzGW07DFCD0KupE0OhbGGQw7UIb4xIyBY51ivt0nzgco8BlUpyz5xVRvtMuyG3GUs0Z97oDJddh40BHoqSHmx8Oh1ffVvSAHJhE3ujo-OFmLQVO7FuL6igDChOoVFijRZ_s-Iq1P3ndTaS3ugpPGmjVXLcwOsZrLnZNmx1nSBI6xi24fHlUv21eg41Yo8gfsI1BFmUzrgPRBEbJHpDdy1nSaMfTachYwkHYrZfWZHYlodgGE1wNVJ-73LbA3jI3JM7xPq0DMVexKsyJNQT6-r47lDtwPVD7P0u9GbzmXsBJM0cxnHcs1wpHBPaF8ymRiTcqiwbuD6k3eZK0yqfhwYcU7nUbI4GkWgQGQ0iiz68W36zaHQ_7p2939lMtj6gkivE9uF9Z8fV8P9X27t_tdfwCJEvv56PL17CJgsoiuk1-9Crb-_cAQZJtX7VwpGAfOAf4A8IXyLp |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=The+missing+piece%3A+a+distributed+system-level+diagnosis+model+for+the+implementation+of+unreliable+failure+detectors&rft.jtitle=Computing&rft.au=Duarte%2C+Elias+P.&rft.au=Rodrigues%2C+Luiz+A.&rft.au=Camargo%2C+Edson+T.&rft.au=Turchetti%2C+Rog%C3%A9rio+C.&rft.date=2023-12-01&rft.issn=0010-485X&rft.eissn=1436-5057&rft.volume=105&rft.issue=12&rft.spage=2821&rft.epage=2845&rft_id=info:doi/10.1007%2Fs00607-023-01211-8&rft.externalDBID=n%2Fa&rft.externalDocID=10_1007_s00607_023_01211_8 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0010-485X&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0010-485X&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0010-485X&client=summon |