MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data

Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs,...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 1057 - 1068
Main Authors Wang, Yidan, Zhu, Zhouruixing, Fu, Qiuai, Ma, Yuchi, He, Pinjia
Format Conference Proceeding
LanguageEnglish
Published ACM 27.10.2024
Subjects
Online AccessGet full text
ISSN2643-1572
DOI10.1145/3691620.3695485

Cover

More Information
Summary:Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs, and metrics) record the status of microservice systems, but most existing RCA approaches rely on single-source data, failing to understand the system fully. (2) Existing RCA approaches ignore the services' anomaly state and their anomaly intensity. (3) The service-level RCAs lack detailed information for quick issue resolution. To tackle these challenges, we propose MRCA, a metric-level RCA approach using multi-modal data. Our key insight is that using multi-modal data allows for a comprehensive understanding of the system, enabling the localization of root causes across more anomaly scenarios. MRCA first utilizes traces and logs to obtain the ranking list of abnormal services based on reconstruction probability. It further builds causal graphs from services with high anomaly probability to discover the order in which abnormal metrics of different services occur. By incorporating a reward mechanism, MRCA terminates the excessive expansion of the causal graph and significantly reduces the time taken for causal analysis. Finally, MRCA can prune the ranking list based on the causal graph and identify metric-level root causes. Experiments on two widely-used microservice benchmarks demonstrate that MRCA outperforms state-of-the-art approaches in terms of both accuracy and efficiency.
ISSN:2643-1572
DOI:10.1145/3691620.3695485