Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization
Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover...
Saved in:
Published in | IEEE/ACM transactions on networking Vol. 27; no. 5; pp. 2001 - 2014 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.10.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
ISSN | 1063-6692 1558-2566 |
DOI | 10.1109/TNET.2019.2938228 |
Cover
Abstract | Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover, failures are not always obvious; network components can fail partially, dropping or delaying only subsets of packets. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate end-host transport-layer flow metrics with per-flow network paths and apply statistical analysis techniques to identify outliers and localize faulty links and/or switches. We evaluate our approach in a production Facebook front-end datacenter, focusing on its effectiveness across a range of traffic patterns. |
---|---|
AbstractList | Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover, failures are not always obvious; network components can fail partially, dropping or delaying only subsets of packets. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate end-host transport-layer flow metrics with per-flow network paths and apply statistical analysis techniques to identify outliers and localize faulty links and/or switches. We evaluate our approach in a production Facebook front-end datacenter, focusing on its effectiveness across a range of traffic patterns. |
Author | Das, Rajdeep Roy, Arjun Zeng, Hongyi Bagga, Jasmeet Snoeren, Alex C. |
Author_xml | – sequence: 1 givenname: Arjun orcidid: 0000-0003-2864-9111 surname: Roy fullname: Roy, Arjun email: arroy@cs.ucsd.edu organization: Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA – sequence: 2 givenname: Rajdeep orcidid: 0000-0003-0513-4967 surname: Das fullname: Das, Rajdeep email: r4das@cs.ucsd.edu organization: Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA – sequence: 3 givenname: Hongyi surname: Zeng fullname: Zeng, Hongyi email: zeng@fb.com organization: Facebook Inc., Menlo Park, CA, USA – sequence: 4 givenname: Jasmeet surname: Bagga fullname: Bagga, Jasmeet email: jasmeetbagga@fb.com organization: Facebook Inc., Menlo Park, CA, USA – sequence: 5 givenname: Alex C. orcidid: 0000-0001-5679-3888 surname: Snoeren fullname: Snoeren, Alex C. email: snoeren@cs.ucsd.edu organization: Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA |
BookMark | eNp9kE1LAzEQhoNUsFZ_gHgJeN6aj93s5ij9UKGoSAveQjY7qynbbE1SQX-9u7Z48OBphuF935l5TtHAtQ4QuqBkTCmR18uH2XLMCJVjJnnBWHGEhjTLioRlQgy6ngieCCHZCToNYU0I5YSJIXpZuQp8iNpV1r3i-AZ4YTc2BtzW-EmHYD8AP4Nuot0AnuqoDbgIHs_1rol4ChFMtK3DXQBetEY39kv3gzN0XOsmwPmhjtBqPltO7pLF4-395GaRmO7QmFBtTE5LyIRhleGciBJAapmVmtdAZAms1HUqK5NXecryAjRnJQdOQaS8kHyErva5W9--7yBEtW533nUrFeMkZzTlOetUdK8yvg3BQ6223m60_1SUqB6g6gGqHqA6AOw8-R-PsfHnt-i1bf51Xu6dFgB-NxUFF4IR_g3444EE |
CODEN | IEANEP |
CitedBy_id | crossref_primary_10_1016_j_comnet_2024_110836 crossref_primary_10_1109_TNET_2021_3137557 crossref_primary_10_1016_j_comnet_2022_109485 crossref_primary_10_1109_OJCOMS_2020_3025663 crossref_primary_10_1002_smr_2413 |
Cites_doi | 10.1109/DSN.2002.1029005 10.1109/ANCS.2013.6665176 10.1145/1016687.1016703 10.1145/2785956.2787508 10.1145/945445.945454 10.1145/1851182.1851220 10.1145/1272996.1273005 10.1145/2342356.2342438 10.1109/INFCOM.2007.252 10.1145/2785956.2787472 10.1145/2342356.2342390 10.1145/1402958.1402967 10.1080/00401706.1962.10490022 10.1109/TDSC.2009.37 10.1145/1080173.1080178 10.1145/2079296.2079304 10.1109/SRDS.2009.22 10.1145/1282380.1282383 10.1145/1135777.1135830 10.1145/2934872.2934884 10.1145/2674005.2674985 10.1109/TNET.2006.880182 10.1145/2785956.2787496 10.1145/2785956.2787483 10.1145/2413176.2413206 10.1109/COMSNETS.2013.6465540 |
ContentType | Journal Article |
Copyright | Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019 |
Copyright_xml | – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019 |
DBID | 97E RIA RIE AAYXX CITATION 7SC 7SP 8FD JQ2 L7M L~C L~D |
DOI | 10.1109/TNET.2019.2938228 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005–Present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Electronics & Communications Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Technology Research Database Computer and Information Systems Abstracts – Academic Electronics & Communications Abstracts ProQuest Computer Science Collection Computer and Information Systems Abstracts Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Technology Research Database |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2566 |
EndPage | 2014 |
ExternalDocumentID | 10_1109_TNET_2019_2938228 8836620 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Science Foundation grantid: CNS-1422240; CNS-1564185 funderid: 10.13039/100000001 |
GroupedDBID | -DZ -~X .DC 0R~ 29I 4.4 5GY 5VS 6IK 85S 8US 97E 9M8 AAJGR AAKMM AALFJ AARMG AASAJ AAWTH AAWTV ABAZT ABPPZ ABQJQ ABVLG ACGFS ACGOD ACIWK ACM ADBCU ADL AEBYY AEFXT AEJOY AENSD AETEA AETIX AFWIH AFWXC AGQYO AGSQL AHBIQ AI. AIBXA AIKLT AKJIK AKQYR AKRVB ALLEH ALMA_UNASSIGNED_HOLDINGS ATWAV BDXCO BEFXN BFFAM BGNUA BKEBE BPEOZ CCLIF CS3 D0L EBS EJD FEDTE GUFHI HF~ HGAVV HZ~ H~9 I07 ICLAB IEDLZ IES IFIPE IFJZH IPLJI JAVBF LAI LHSKQ M43 MVM O9- OCL P1C P2P PQQKQ RIA RIE RNS ROL TN5 UPT UQL VH1 XOL YR2 ZCA AAYOK AAYXX CITATION RIG 7SC 7SP 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c293t-1acc71be56c2dc3306bee9a95ba3fe09be2baf49dc7d74278ea32b3e31e643893 |
IEDL.DBID | RIE |
ISSN | 1063-6692 |
IngestDate | Mon Jun 30 06:50:10 EDT 2025 Tue Jul 01 01:49:23 EDT 2025 Thu Apr 24 22:57:05 EDT 2025 Wed Aug 27 02:43:04 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 5 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c293t-1acc71be56c2dc3306bee9a95ba3fe09be2baf49dc7d74278ea32b3e31e643893 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0003-0513-4967 0000-0001-5679-3888 0000-0003-2864-9111 |
PQID | 2307214372 |
PQPubID | 32020 |
PageCount | 14 |
ParticipantIDs | ieee_primary_8836620 proquest_journals_2307214372 crossref_primary_10_1109_TNET_2019_2938228 crossref_citationtrail_10_1109_TNET_2019_2938228 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2019-Oct. 2019-10-00 20191001 |
PublicationDateYYYYMMDD | 2019-10-01 |
PublicationDate_xml | – month: 10 year: 2019 text: 2019-Oct. |
PublicationDecade | 2010 |
PublicationPlace | New York |
PublicationPlace_xml | – name: New York |
PublicationTitle | IEEE/ACM transactions on networking |
PublicationTitleAbbrev | TNET |
PublicationYear | 2019 |
Publisher | IEEE The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Publisher_xml | – name: IEEE – name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
References | ref35 ref13 ref12 ref37 maltz (ref34) 2017 ref14 hoff (ref27) 0 ref30 ref33 ref11 (ref5) 2017 ref32 ref39 ref17 ref38 adams (ref6) 2016 handigol (ref26) 2014 (ref2) 2017 (ref21) 2018 (ref18) 2017 bronson (ref15) 2013 forman (ref22) 1998 kompella (ref31) 2005 (ref4) 2017 ref46 ref24 ref45 ref48 dean (ref19) 2004 ref47 ref25 ref20 ref42 roy (ref41) 2017 ref44 andreyev (ref9) 2014 (ref1) 2017 ref28 ref29 ref8 ref7 mysore (ref36) 2014 (ref3) 2017 arzani (ref10) 2018 chen (ref16) 2004 greenberg (ref23) 2016 ref40 turner (ref43) 2012 |
References_xml | – year: 2017 ident: ref34 publication-title: private communication – start-page: 419 year: 2018 ident: ref10 article-title: 007: Democratically finding the cause of packet drops publication-title: Proc NSDI – ident: ref17 doi: 10.1109/DSN.2002.1029005 – ident: ref37 doi: 10.1109/ANCS.2013.6665176 – year: 2017 ident: ref18 publication-title: BGP Support for TTL Security Check – ident: ref39 doi: 10.1145/1016687.1016703 – start-page: 137 year: 2004 ident: ref19 article-title: MapReduce: Simplified data processing on large clusters publication-title: Proc OSDI – ident: ref42 doi: 10.1145/2785956.2787508 – ident: ref7 doi: 10.1145/945445.945454 – ident: ref44 doi: 10.1145/1851182.1851220 – year: 2017 ident: ref4 publication-title: Hadoop – ident: ref28 doi: 10.1145/1272996.1273005 – year: 2017 ident: ref1 publication-title: abApache HTTP Server Benchmarking Tool – year: 2018 ident: ref21 publication-title: Facebook Warm Storage-Next Generation Storage for Data Warehouse in Hadoop Ecosystem – year: 0 ident: ref27 publication-title: Latency Is Everywhere and It Costs You Sales How to Crush It – ident: ref46 doi: 10.1145/2342356.2342438 – ident: ref32 doi: 10.1109/INFCOM.2007.252 – start-page: 595 year: 2017 ident: ref41 article-title: Passive realtime datacenter fault detection and localization publication-title: Proc NSDI – ident: ref40 doi: 10.1145/2785956.2787472 – ident: ref47 doi: 10.1145/2342356.2342390 – ident: ref8 doi: 10.1145/1402958.1402967 – ident: ref45 doi: 10.1080/00401706.1962.10490022 – ident: ref33 doi: 10.1109/TDSC.2009.37 – start-page: 71 year: 2014 ident: ref26 article-title: I know what your packet did last hop: Using packet histories to troubleshoot networks publication-title: Proc NSDI – ident: ref30 doi: 10.1145/1080173.1080178 – ident: ref14 doi: 10.1145/2079296.2079304 – year: 2016 ident: ref6 publication-title: NetNO-RAD Troubleshooting networks via end-to-end probing – ident: ref13 doi: 10.1109/SRDS.2009.22 – year: 2014 ident: ref9 publication-title: Introducing data center fabric the next-generation facebook data center network – ident: ref12 doi: 10.1145/1282380.1282383 – year: 1998 ident: ref22 article-title: Automated whole-system diagnosis of distributed services using model-based reasoning publication-title: Proc IFIP/IEEE Int Workshop Distrib Syst Oper Manage – ident: ref38 doi: 10.1145/1135777.1135830 – ident: ref11 doi: 10.1145/2934872.2934884 – year: 2017 ident: ref2 publication-title: Bpf compiler collection (bcc) – year: 2014 ident: ref36 article-title: Gestalt: Fast, unified fault localization for networked systems publication-title: Proc USENIX ATC – year: 2017 ident: ref3 publication-title: Extending Extended BPF – start-page: 57 year: 2005 ident: ref31 article-title: IP fault localization via risk modeling publication-title: Proc NSDI – ident: ref29 doi: 10.1145/2674005.2674985 – start-page: 49 year: 2013 ident: ref15 article-title: TAO: Facebook's distributed data store for the social graph publication-title: Proc USENIX ATC – year: 2017 ident: ref5 publication-title: HHVM – ident: ref20 doi: 10.1109/TNET.2006.880182 – ident: ref24 doi: 10.1145/2785956.2787496 – ident: ref48 doi: 10.1145/2785956.2787483 – start-page: 309 year: 2004 ident: ref16 article-title: Path-based failure and evolution management publication-title: Proc NSDI – ident: ref25 doi: 10.1145/2413176.2413206 – ident: ref35 doi: 10.1109/COMSNETS.2013.6465540 – year: 2016 ident: ref23 publication-title: PingMesh + NetBouncer Fine-Grained Path and Link Monitoring for Data Centers – year: 2012 ident: ref43 article-title: On failure in managed enterprise networks |
SSID | ssj0013026 |
Score | 2.344792 |
Snippet | Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing... |
SourceID | proquest crossref ieee |
SourceType | Aggregation Database Enrichment Source Index Database Publisher |
StartPage | 2001 |
SubjectTerms | Circuit faults Computer network reliability Data analysis Data centers Failure rates Fault detection Fault diagnosis Hardware Localization Monitoring Outliers (statistics) Production Statistical analysis Switches |
Title | Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization |
URI | https://ieeexplore.ieee.org/document/8836620 https://www.proquest.com/docview/2307214372 |
Volume | 27 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJxh4I8pLHpgQaZPYseMRUaoKAUKolbpFtnOREKVFJV349fictDyF2DLYlnV38b2-uyPkVLNQi0KyILSWOweF54GWwINEqkJivy7uQxe3d6I_5NejZNQg58taGADw4DNo46fP5edTO8dQWSdNmRCxc9BXnJhVtVofGYPQj1ZzHg4LhFBxncGMQtUZ3F0NEMSl2k63OYWYftFBfqjKj5fYq5feBrldXKxClTy156Vp27dvPRv_e_NNsl7bmfSiEowt0oDJNln71H1wh4yGnwtbqLMEqS93eqXTgt47o9o9hPQB8WGPz0C7utSI5IQZ7en5uKRdKD2Ma0LdAfQGdWJd07lLhr2rwWU_qActBNZRpAwiba2MDCTCxrllzoswAEqrxGhWQKgMxEYXXOVW5hJnc4BmscHwKQicns72SHMyncA-oaqwiWTSiJBH3FiVFmBNqrhmJk2NNi0SLkif2boLOQ7DGGfeGwlVhtzKkFtZza0WOVtuealacPy1eAepv1xYE75Fjhb8zeqf9DVDDHwcYeLy4Pddh2QVz66we0ekWc7mcOxskNKceOF7Bxw52Lc |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT9swED4heNj2MH5tooyBH3iaSElix44fEVB1W1tNUyv1LbKdi4Qo7QTpC3_9fE7asYHQ3vJgJ9ad4-_O990dwKnhsZGV4lHsnPAOiigjo1BEmdKVonpdIlxdDEeyPxHfptl0A87WuTCIGMhn2KXHEMsvF25JV2Xnec6lTL2DvuVxX2RNttafmEEcmqt5H4dHUuq0jWEmsT4fj67HROPSXY9uHhLzv1AotFV5dhYHgOltw3C1tIZXcttd1rbrHv-p2vi_a9-B962lyS6arbELGzjfg3dP6g_uw3TyNLWFeVuQhYSnB7ao2A9vVvujkP0khtjNHbIrUxvicuI965nlrGZXWAci15z5F7ABoWKb1fkBJr3r8WU_alstRM5LpI4S45xKLGbSpaXj3o-wiNrozBpeYawtptZUQpdOlYq6c6DhqaULVJTUP51_hM35Yo4HwHTlMsWVlbFIhHU6r9DZXAvDbZ5bYzsQr0RfuLYOObXDmBXBH4l1QdoqSFtFq60OfFlP-dUU4Xht8D5Jfz2wFXwHjlb6Ldrf9KEgFnyaUOjy8OVZJ_CmPx4OisHX0fdP8Ja-0zD5jmCzvl_iZ2-R1PY4bMTfnGTcBA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Understanding+the+Limits+of+Passive+Realtime+Datacenter+Fault+Detection+and+Localization&rft.jtitle=IEEE%2FACM+transactions+on+networking&rft.au=Roy%2C+Arjun&rft.au=Das%2C+Rajdeep&rft.au=Zeng%2C+Hongyi&rft.au=Bagga%2C+Jasmeet&rft.date=2019-10-01&rft.issn=1063-6692&rft.eissn=1558-2566&rft.volume=27&rft.issue=5&rft.spage=2001&rft.epage=2014&rft_id=info:doi/10.1109%2FTNET.2019.2938228&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TNET_2019_2938228 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6692&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6692&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6692&client=summon |