Understanding the Limits of Passive Realtime Datacenter Fault Detection and Localization

Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on networking Vol. 27; no. 5; pp. 2001 - 2014
Main Authors Roy, Arjun, Das, Rajdeep, Zeng, Hongyi, Bagga, Jasmeet, Snoeren, Alex C.
Format Journal Article
LanguageEnglish
Published New York IEEE 01.10.2019
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN1063-6692
1558-2566
DOI10.1109/TNET.2019.2938228

Cover

More Information
Summary:Datacenters are characterized by large scale, stringent reliability requirements, and significant application diversity. However, the realities of employing hardware with non-zero failure rates mean that datacenters are subject to significant numbers of failures that can impact performance. Moreover, failures are not always obvious; network components can fail partially, dropping or delaying only subsets of packets. Thus, traditional fault detection techniques involving end-host or router-based statistics can fall short in their ability to identify these errors. We describe how to expedite the process of detecting and localizing partial datacenter faults using an end-host method generalizable to most datacenter applications. In particular, we correlate end-host transport-layer flow metrics with per-flow network paths and apply statistical analysis techniques to identify outliers and localize faulty links and/or switches. We evaluate our approach in a production Facebook front-end datacenter, focusing on its effectiveness across a range of traffic patterns.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1063-6692
1558-2566
DOI:10.1109/TNET.2019.2938228