Root Cause Analysis for Cloud-Native Applications

Root cause analysis (RCA) is a critical component in maintaining the reliability and performance of modern cloud applications. However, due to the inherent complexity of cloud environments, traditional RCA techniques become insufficient in supporting system administrators in daily incident response...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on cloud computing Vol. 12; no. 1; pp. 232 - 250
Main Authors Zurkowski, Bartosz, Zielinski, Krzysztof
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.01.2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Root cause analysis (RCA) is a critical component in maintaining the reliability and performance of modern cloud applications. However, due to the inherent complexity of cloud environments, traditional RCA techniques become insufficient in supporting system administrators in daily incident response routines. This article presents an RCA solution specifically designed for cloud applications, capable of pinpointing failure root causes and recreating complete fault trajectories from the root cause to the effect. The novelty of our approach lies in approximating causal symptom dependencies by synergizing several symptom correlation methods that assess symptoms in terms of structural, semantic, and temporal aspects. The solution integrates statistical methods with system structure and behavior mining, offering a more comprehensive analysis than existing techniques. Based on these concepts, in this work, we provide definitions and construction algorithms for RCA model structures used in the inference, propose a symptom correlation framework encompassing essential elements of symptom data analysis, and provide a detailed description of the elaborated root cause identification process. Functional evaluation on a live microservice-based system demonstrates the effectiveness of our approach in identifying root causes of complex failures across multiple cloud layers.
ISSN:2168-7161
2168-7161
2372-0018
DOI:10.1109/TCC.2024.3358823