SuperBench: Improving Cloud AI Infrastructure Reliability with Proactive Validation
Reliability in cloud AI infrastructure is crucial for cloud service providers, prompting the widespread use of hardware redundancies. However, these redundancies can inadvertently lead to hidden degradation, so called "gray failure", for AI workloads, significantly affecting end-to-end per...
Saved in:
Main Authors | , , , , , , , , , , , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
09.02.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Reliability in cloud AI infrastructure is crucial for cloud service
providers, prompting the widespread use of hardware redundancies. However,
these redundancies can inadvertently lead to hidden degradation, so called
"gray failure", for AI workloads, significantly affecting end-to-end
performance and concealing performance issues, which complicates root cause
analysis for failures and regressions.
We introduce SuperBench, a proactive validation system for AI infrastructure
that mitigates hidden degradation caused by hardware redundancies and enhances
overall reliability. SuperBench features a comprehensive benchmark suite,
capable of evaluating individual hardware components and representing most real
AI workloads. It comprises a Validator which learns benchmark criteria to
clearly pinpoint defective components. Additionally, SuperBench incorporates a
Selector to balance validation time and issue-related penalties, enabling
optimal timing for validation execution with a tailored subset of benchmarks.
Through testbed evaluation and simulation, we demonstrate that SuperBench can
increase the mean time between incidents by up to 22.61x. SuperBench has been
successfully deployed in Azure production, validating hundreds of thousands of
GPUs over the last two years. |
---|---|
DOI: | 10.48550/arxiv.2402.06194 |