Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization

Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on networking Vol. 27; no. 5; pp. 2001 - 2014
Main Authors Roy, Arjun, Das, Rajdeep, Zeng, Hongyi, Bagga, Jasmeet, Snoeren, Alex C.
Format Journal Article
LanguageEnglish
Published New York IEEE 01.10.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1063-6692
1558-2566
DOI10.1109/TNET.2019.2938228

Cover

Abstract Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover, failures are not always obvious; network components can fail partially, dropping or delaying only subsets of packets. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate end-host transport-layer flow metrics with per-flow network paths and apply statistical analysis techniques to identify outliers and localize faulty links and/or switches. We evaluate our approach in a production Facebook front-end datacenter, focusing on its effectiveness across a range of traffic patterns.
AbstractList Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover, failures are not always obvious; network components can fail partially, dropping or delaying only subsets of packets. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate end-host transport-layer flow metrics with per-flow network paths and apply statistical analysis techniques to identify outliers and localize faulty links and/or switches. We evaluate our approach in a production Facebook front-end datacenter, focusing on its effectiveness across a range of traffic patterns.
Author Das, Rajdeep
Roy, Arjun
Zeng, Hongyi
Bagga, Jasmeet
Snoeren, Alex C.
Author_xml – sequence: 1
  givenname: Arjun
  orcidid: 0000-0003-2864-9111
  surname: Roy
  fullname: Roy, Arjun
  email: arroy@cs.ucsd.edu
  organization: Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA
– sequence: 2
  givenname: Rajdeep
  orcidid: 0000-0003-0513-4967
  surname: Das
  fullname: Das, Rajdeep
  email: r4das@cs.ucsd.edu
  organization: Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA
– sequence: 3
  givenname: Hongyi
  surname: Zeng
  fullname: Zeng, Hongyi
  email: zeng@fb.com
  organization: Facebook Inc., Menlo Park, CA, USA
– sequence: 4
  givenname: Jasmeet
  surname: Bagga
  fullname: Bagga, Jasmeet
  email: jasmeetbagga@fb.com
  organization: Facebook Inc., Menlo Park, CA, USA
– sequence: 5
  givenname: Alex C.
  orcidid: 0000-0001-5679-3888
  surname: Snoeren
  fullname: Snoeren, Alex C.
  email: snoeren@cs.ucsd.edu
  organization: Department of Computer Science and Engineering, University of California at San Diego, La Jolla, CA, USA
BookMark eNp9kE1LAzEQhoNUsFZ_gHgJeN6aj93s5ij9UKGoSAveQjY7qynbbE1SQX-9u7Z48OBphuF935l5TtHAtQ4QuqBkTCmR18uH2XLMCJVjJnnBWHGEhjTLioRlQgy6ngieCCHZCToNYU0I5YSJIXpZuQp8iNpV1r3i-AZ4YTc2BtzW-EmHYD8AP4Nuot0AnuqoDbgIHs_1rol4ChFMtK3DXQBetEY39kv3gzN0XOsmwPmhjtBqPltO7pLF4-395GaRmO7QmFBtTE5LyIRhleGciBJAapmVmtdAZAms1HUqK5NXecryAjRnJQdOQaS8kHyErva5W9--7yBEtW533nUrFeMkZzTlOetUdK8yvg3BQ6223m60_1SUqB6g6gGqHqA6AOw8-R-PsfHnt-i1bf51Xu6dFgB-NxUFF4IR_g3444EE
CODEN IEANEP
CitedBy_id crossref_primary_10_1016_j_comnet_2024_110836
crossref_primary_10_1109_TNET_2021_3137557
crossref_primary_10_1016_j_comnet_2022_109485
crossref_primary_10_1109_OJCOMS_2020_3025663
crossref_primary_10_1002_smr_2413
Cites_doi 10.1109/DSN.2002.1029005
10.1109/ANCS.2013.6665176
10.1145/1016687.1016703
10.1145/2785956.2787508
10.1145/945445.945454
10.1145/1851182.1851220
10.1145/1272996.1273005
10.1145/2342356.2342438
10.1109/INFCOM.2007.252
10.1145/2785956.2787472
10.1145/2342356.2342390
10.1145/1402958.1402967
10.1080/00401706.1962.10490022
10.1109/TDSC.2009.37
10.1145/1080173.1080178
10.1145/2079296.2079304
10.1109/SRDS.2009.22
10.1145/1282380.1282383
10.1145/1135777.1135830
10.1145/2934872.2934884
10.1145/2674005.2674985
10.1109/TNET.2006.880182
10.1145/2785956.2787496
10.1145/2785956.2787483
10.1145/2413176.2413206
10.1109/COMSNETS.2013.6465540
ContentType Journal Article
Copyright Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
Copyright_xml – notice: Copyright The Institute of Electrical and Electronics Engineers, Inc. (IEEE) 2019
DBID 97E
RIA
RIE
AAYXX
CITATION
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
DOI 10.1109/TNET.2019.2938228
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005–Present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
Computer and Information Systems Abstracts
Electronics & Communications Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Technology Research Database
Computer and Information Systems Abstracts – Academic
Electronics & Communications Abstracts
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts Professional
DatabaseTitleList
Technology Research Database
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2566
EndPage 2014
ExternalDocumentID 10_1109_TNET_2019_2938228
8836620
Genre orig-research
GrantInformation_xml – fundername: National Science Foundation
  grantid: CNS-1422240; CNS-1564185
  funderid: 10.13039/100000001
GroupedDBID -DZ
-~X
.DC
0R~
29I
4.4
5GY
5VS
6IK
85S
8US
97E
9M8
AAJGR
AAKMM
AALFJ
AARMG
AASAJ
AAWTH
AAWTV
ABAZT
ABPPZ
ABQJQ
ABVLG
ACGFS
ACGOD
ACIWK
ACM
ADBCU
ADL
AEBYY
AEFXT
AEJOY
AENSD
AETEA
AETIX
AFWIH
AFWXC
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AIKLT
AKJIK
AKQYR
AKRVB
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ATWAV
BDXCO
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CCLIF
CS3
D0L
EBS
EJD
FEDTE
GUFHI
HF~
HGAVV
HZ~
H~9
I07
ICLAB
IEDLZ
IES
IFIPE
IFJZH
IPLJI
JAVBF
LAI
LHSKQ
M43
MVM
O9-
OCL
P1C
P2P
PQQKQ
RIA
RIE
RNS
ROL
TN5
UPT
UQL
VH1
XOL
YR2
ZCA
AAYOK
AAYXX
CITATION
RIG
7SC
7SP
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c293t-1acc71be56c2dc3306bee9a95ba3fe09be2baf49dc7d74278ea32b3e31e643893
IEDL.DBID RIE
ISSN 1063-6692
IngestDate Mon Jun 30 06:50:10 EDT 2025
Tue Jul 01 01:49:23 EDT 2025
Thu Apr 24 22:57:05 EDT 2025
Wed Aug 27 02:43:04 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 5
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c293t-1acc71be56c2dc3306bee9a95ba3fe09be2baf49dc7d74278ea32b3e31e643893
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-0513-4967
0000-0001-5679-3888
0000-0003-2864-9111
PQID 2307214372
PQPubID 32020
PageCount 14
ParticipantIDs ieee_primary_8836620
proquest_journals_2307214372
crossref_primary_10_1109_TNET_2019_2938228
crossref_citationtrail_10_1109_TNET_2019_2938228
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2019-Oct.
2019-10-00
20191001
PublicationDateYYYYMMDD 2019-10-01
PublicationDate_xml – month: 10
  year: 2019
  text: 2019-Oct.
PublicationDecade 2010
PublicationPlace New York
PublicationPlace_xml – name: New York
PublicationTitle IEEE/ACM transactions on networking
PublicationTitleAbbrev TNET
PublicationYear 2019
Publisher IEEE
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Publisher_xml – name: IEEE
– name: The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
References ref35
ref13
ref12
ref37
maltz (ref34) 2017
ref14
hoff (ref27) 0
ref30
ref33
ref11
(ref5) 2017
ref32
ref39
ref17
ref38
adams (ref6) 2016
handigol (ref26) 2014
(ref2) 2017
(ref21) 2018
(ref18) 2017
bronson (ref15) 2013
forman (ref22) 1998
kompella (ref31) 2005
(ref4) 2017
ref46
ref24
ref45
ref48
dean (ref19) 2004
ref47
ref25
ref20
ref42
roy (ref41) 2017
ref44
andreyev (ref9) 2014
(ref1) 2017
ref28
ref29
ref8
ref7
mysore (ref36) 2014
(ref3) 2017
arzani (ref10) 2018
chen (ref16) 2004
greenberg (ref23) 2016
ref40
turner (ref43) 2012
References_xml – year: 2017
  ident: ref34
  publication-title: private communication
– start-page: 419
  year: 2018
  ident: ref10
  article-title: 007: Democratically finding the cause of packet drops
  publication-title: Proc NSDI
– ident: ref17
  doi: 10.1109/DSN.2002.1029005
– ident: ref37
  doi: 10.1109/ANCS.2013.6665176
– year: 2017
  ident: ref18
  publication-title: BGP Support for TTL Security Check
– ident: ref39
  doi: 10.1145/1016687.1016703
– start-page: 137
  year: 2004
  ident: ref19
  article-title: MapReduce: Simplified data processing on large clusters
  publication-title: Proc OSDI
– ident: ref42
  doi: 10.1145/2785956.2787508
– ident: ref7
  doi: 10.1145/945445.945454
– ident: ref44
  doi: 10.1145/1851182.1851220
– year: 2017
  ident: ref4
  publication-title: Hadoop
– ident: ref28
  doi: 10.1145/1272996.1273005
– year: 2017
  ident: ref1
  publication-title: abApache HTTP Server Benchmarking Tool
– year: 2018
  ident: ref21
  publication-title: Facebook Warm Storage-Next Generation Storage for Data Warehouse in Hadoop Ecosystem
– year: 0
  ident: ref27
  publication-title: Latency Is Everywhere and It Costs You Sales How to Crush It
– ident: ref46
  doi: 10.1145/2342356.2342438
– ident: ref32
  doi: 10.1109/INFCOM.2007.252
– start-page: 595
  year: 2017
  ident: ref41
  article-title: Passive realtime datacenter fault detection and localization
  publication-title: Proc NSDI
– ident: ref40
  doi: 10.1145/2785956.2787472
– ident: ref47
  doi: 10.1145/2342356.2342390
– ident: ref8
  doi: 10.1145/1402958.1402967
– ident: ref45
  doi: 10.1080/00401706.1962.10490022
– ident: ref33
  doi: 10.1109/TDSC.2009.37
– start-page: 71
  year: 2014
  ident: ref26
  article-title: I know what your packet did last hop: Using packet histories to troubleshoot networks
  publication-title: Proc NSDI
– ident: ref30
  doi: 10.1145/1080173.1080178
– ident: ref14
  doi: 10.1145/2079296.2079304
– year: 2016
  ident: ref6
  publication-title: NetNO-RAD Troubleshooting networks via end-to-end probing
– ident: ref13
  doi: 10.1109/SRDS.2009.22
– year: 2014
  ident: ref9
  publication-title: Introducing data center fabric the next-generation facebook data center network
– ident: ref12
  doi: 10.1145/1282380.1282383
– year: 1998
  ident: ref22
  article-title: Automated whole-system diagnosis of distributed services using model-based reasoning
  publication-title: Proc IFIP/IEEE Int Workshop Distrib Syst Oper Manage
– ident: ref38
  doi: 10.1145/1135777.1135830
– ident: ref11
  doi: 10.1145/2934872.2934884
– year: 2017
  ident: ref2
  publication-title: Bpf compiler collection (bcc)
– year: 2014
  ident: ref36
  article-title: Gestalt: Fast, unified fault localization for networked systems
  publication-title: Proc USENIX ATC
– year: 2017
  ident: ref3
  publication-title: Extending Extended BPF
– start-page: 57
  year: 2005
  ident: ref31
  article-title: IP fault localization via risk modeling
  publication-title: Proc NSDI
– ident: ref29
  doi: 10.1145/2674005.2674985
– start-page: 49
  year: 2013
  ident: ref15
  article-title: TAO: Facebook's distributed data store for the social graph
  publication-title: Proc USENIX ATC
– year: 2017
  ident: ref5
  publication-title: HHVM
– ident: ref20
  doi: 10.1109/TNET.2006.880182
– ident: ref24
  doi: 10.1145/2785956.2787496
– ident: ref48
  doi: 10.1145/2785956.2787483
– start-page: 309
  year: 2004
  ident: ref16
  article-title: Path-based failure and evolution management
  publication-title: Proc NSDI
– ident: ref25
  doi: 10.1145/2413176.2413206
– ident: ref35
  doi: 10.1109/COMSNETS.2013.6465540
– year: 2016
  ident: ref23
  publication-title: PingMesh + NetBouncer Fine-Grained Path and Link Monitoring for Data Centers
– year: 2012
  ident: ref43
  article-title: On failure in managed enterprise networks
SSID ssj0013026
Score 2.344792
Snippet Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing...
SourceID proquest
crossref
ieee
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 2001
SubjectTerms Circuit faults
Computer network reliability
Data analysis
Data centers
Facebook
Failure rates
Fault detection
Fault diagnosis
Hardware
Localization
Monitoring
Outliers (statistics)
Production
Statistical analysis
Switches
Title Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization
URI https://ieeexplore.ieee.org/document/8836620
https://www.proquest.com/docview/2307214372
Volume 27
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV07T8MwELZKJxh4I8pLHpgQaZPYseMRUaoKAUKolbpFtnOREKVFJV349fictDyF2DLYlnV38b2-uyPkVLNQi0KyILSWOweF54GWwINEqkJivy7uQxe3d6I_5NejZNQg58taGADw4DNo46fP5edTO8dQWSdNmRCxc9BXnJhVtVofGYPQj1ZzHg4LhFBxncGMQtUZ3F0NEMSl2k63OYWYftFBfqjKj5fYq5feBrldXKxClTy156Vp27dvPRv_e_NNsl7bmfSiEowt0oDJNln71H1wh4yGnwtbqLMEqS93eqXTgt47o9o9hPQB8WGPz0C7utSI5IQZ7en5uKRdKD2Ma0LdAfQGdWJd07lLhr2rwWU_qActBNZRpAwiba2MDCTCxrllzoswAEqrxGhWQKgMxEYXXOVW5hJnc4BmscHwKQicns72SHMyncA-oaqwiWTSiJBH3FiVFmBNqrhmJk2NNi0SLkif2boLOQ7DGGfeGwlVhtzKkFtZza0WOVtuealacPy1eAepv1xYE75Fjhb8zeqf9DVDDHwcYeLy4Pddh2QVz66we0ekWc7mcOxskNKceOF7Bxw52Lc
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fT9swED4heNj2MH5tooyBH3iaSElix44fEVB1W1tNUyv1LbKdi4Qo7QTpC3_9fE7asYHQ3vJgJ9ad4-_O990dwKnhsZGV4lHsnPAOiigjo1BEmdKVonpdIlxdDEeyPxHfptl0A87WuTCIGMhn2KXHEMsvF25JV2Xnec6lTL2DvuVxX2RNttafmEEcmqt5H4dHUuq0jWEmsT4fj67HROPSXY9uHhLzv1AotFV5dhYHgOltw3C1tIZXcttd1rbrHv-p2vi_a9-B962lyS6arbELGzjfg3dP6g_uw3TyNLWFeVuQhYSnB7ao2A9vVvujkP0khtjNHbIrUxvicuI965nlrGZXWAci15z5F7ABoWKb1fkBJr3r8WU_alstRM5LpI4S45xKLGbSpaXj3o-wiNrozBpeYawtptZUQpdOlYq6c6DhqaULVJTUP51_hM35Yo4HwHTlMsWVlbFIhHU6r9DZXAvDbZ5bYzsQr0RfuLYOObXDmBXBH4l1QdoqSFtFq60OfFlP-dUU4Xht8D5Jfz2wFXwHjlb6Ldrf9KEgFnyaUOjy8OVZJ_CmPx4OisHX0fdP8Ja-0zD5jmCzvl_iZ2-R1PY4bMTfnGTcBA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Understanding+the+Limits+of+Passive+Realtime+Datacenter+Fault+Detection+and+Localization&rft.jtitle=IEEE%2FACM+transactions+on+networking&rft.au=Roy%2C+Arjun&rft.au=Das%2C+Rajdeep&rft.au=Zeng%2C+Hongyi&rft.au=Bagga%2C+Jasmeet&rft.date=2019-10-01&rft.issn=1063-6692&rft.eissn=1558-2566&rft.volume=27&rft.issue=5&rft.spage=2001&rft.epage=2014&rft_id=info:doi/10.1109%2FTNET.2019.2938228&rft.externalDBID=n%2Fa&rft.externalDocID=10_1109_TNET_2019_2938228
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1063-6692&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1063-6692&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1063-6692&client=summon