MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data

Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs,...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 1057 - 1068
Main Authors Wang, Yidan, Zhu, Zhouruixing, Fu, Qiuai, Ma, Yuchi, He, Pinjia
Format Conference Proceeding
LanguageEnglish
Published ACM 27.10.2024
Subjects
Online AccessGet full text
ISSN2643-1572
DOI10.1145/3691620.3695485

Cover

Abstract Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs, and metrics) record the status of microservice systems, but most existing RCA approaches rely on single-source data, failing to understand the system fully. (2) Existing RCA approaches ignore the services' anomaly state and their anomaly intensity. (3) The service-level RCAs lack detailed information for quick issue resolution. To tackle these challenges, we propose MRCA, a metric-level RCA approach using multi-modal data. Our key insight is that using multi-modal data allows for a comprehensive understanding of the system, enabling the localization of root causes across more anomaly scenarios. MRCA first utilizes traces and logs to obtain the ranking list of abnormal services based on reconstruction probability. It further builds causal graphs from services with high anomaly probability to discover the order in which abnormal metrics of different services occur. By incorporating a reward mechanism, MRCA terminates the excessive expansion of the causal graph and significantly reduces the time taken for causal analysis. Finally, MRCA can prune the ranking list based on the causal graph and identify metric-level root causes. Experiments on two widely-used microservice benchmarks demonstrate that MRCA outperforms state-of-the-art approaches in terms of both accuracy and efficiency.
AbstractList Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs, and metrics) record the status of microservice systems, but most existing RCA approaches rely on single-source data, failing to understand the system fully. (2) Existing RCA approaches ignore the services' anomaly state and their anomaly intensity. (3) The service-level RCAs lack detailed information for quick issue resolution. To tackle these challenges, we propose MRCA, a metric-level RCA approach using multi-modal data. Our key insight is that using multi-modal data allows for a comprehensive understanding of the system, enabling the localization of root causes across more anomaly scenarios. MRCA first utilizes traces and logs to obtain the ranking list of abnormal services based on reconstruction probability. It further builds causal graphs from services with high anomaly probability to discover the order in which abnormal metrics of different services occur. By incorporating a reward mechanism, MRCA terminates the excessive expansion of the causal graph and significantly reduces the time taken for causal analysis. Finally, MRCA can prune the ranking list based on the causal graph and identify metric-level root causes. Experiments on two widely-used microservice benchmarks demonstrate that MRCA outperforms state-of-the-art approaches in terms of both accuracy and efficiency.
Author Zhu, Zhouruixing
Wang, Yidan
Ma, Yuchi
Fu, Qiuai
He, Pinjia
Author_xml – sequence: 1
  givenname: Yidan
  surname: Wang
  fullname: Wang, Yidan
  email: phoebeyidanwang@gmail.com
  organization: The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen),Shenzhen,China
– sequence: 2
  givenname: Zhouruixing
  surname: Zhu
  fullname: Zhu, Zhouruixing
  email: zhouruixingzhu@link.cuhk.edu.cn
  organization: The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen),Shenzhen,China
– sequence: 3
  givenname: Qiuai
  surname: Fu
  fullname: Fu, Qiuai
  email: fuqiuai@huawei.com
  organization: Huawei Cloud Computing Technologies CO., LTD, China,Shenzhen,China
– sequence: 4
  givenname: Yuchi
  surname: Ma
  fullname: Ma, Yuchi
  email: mayuchi1@huawei.com
  organization: Huawei Cloud Computing Technologies CO., LTD, China,Shenzhen,China
– sequence: 5
  givenname: Pinjia
  surname: He
  fullname: He, Pinjia
  email: hepinjia@cuhk.edu.cn
  organization: The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen),Shenzhen Research Institute of Big Data,Shenzhen,China
BookMark eNotj0tLw0AURkdRsNas3biYP5A6N_OMuxDrAxqEoutyM7kDA7GRTBrovzegq_OtDt-5ZVfH4UiM3YPYACj9KE0JphCbhVo5fcGy0pZOCWGhUM5eslVhlMxB2-KGZSnFVixTGwCzYk2zr6sn3tA0Rp_3NFPP98Mw8RpPiXh1xP6cYuJhGHkT_TgkGufoKfE5Im9O_RTzZuiw58844R27Dtgnyv65Zl8v28_6Ld99vL7X1S7H5dGUdyTB-VJ4iYV1qERAVNIqtKokcM6G0ou2C60kWgrACIPBtTIY0N5pJdfs4c8biejwM8ZvHM8HENYo55z8BdsuTyc
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1145/3691620.3695485
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEL
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798400712487
EISSN 2643-1572
EndPage 1068
ExternalDocumentID 10764888
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  funderid: 10.13039/501100001809
– fundername: Shenzhen Research Institute of Big Data
  funderid: 10.13039/501100020785
GroupedDBID 6IE
6IF
6IH
6IK
6IL
6IM
6IN
6J9
AAJGR
AAWTH
ABLEC
ACREN
ADYOE
ADZIZ
AFYQB
ALMA_UNASSIGNED_HOLDINGS
AMTXH
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IPLJI
M43
OCL
RIE
RIL
ID FETCH-LOGICAL-a248t-de318c90c3a278a40faa4374a749e1887f9c0bdfb3ee4871606af8b3f615c8543
IEDL.DBID RIE
IngestDate Wed Jan 15 06:20:39 EST 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a248t-de318c90c3a278a40faa4374a749e1887f9c0bdfb3ee4871606af8b3f615c8543
PageCount 12
ParticipantIDs ieee_primary_10764888
PublicationCentury 2000
PublicationDate 2024-Oct.-27
PublicationDateYYYYMMDD 2024-10-27
PublicationDate_xml – month: 10
  year: 2024
  text: 2024-Oct.-27
  day: 27
PublicationDecade 2020
PublicationTitle IEEE/ACM International Conference on Automated Software Engineering : [proceedings]
PublicationTitleAbbrev ASE
PublicationYear 2024
Publisher ACM
Publisher_xml – name: ACM
SSID ssib057256116
ssj0051577
Score 2.285896
Snippet Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated...
SourceID ieee
SourceType Publisher
StartPage 1057
SubjectTerms Accuracy
Complexity theory
Location awareness
Measurement
Microservice architectures
Microservices
Multi-modal
Reinforcement Learning
Root cause analysis
Software engineering
Software reliability
Stability analysis
Thermal stability
Title MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
URI https://ieeexplore.ieee.org/document/10764888
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT57qo-KbPXhNTbKb3Y03qZYipEix0FuZbDYglkZs4sFf70weioLgLSwElt2Z-WZ25pth7EqCNKhD1sulwgAFAdAzNsyJqAsyi4VNfSIKJzM1XciHZbRsyeo1F8Y5VxefuRF91rn8rLAVPZWhhmuFAmd6rIdy1pC1OuGJNIJ3QL5OY4YRp7Vue_kEMroWCh2hEGNURS3Ooh_DVGosmQzYrNtFU0LyMqrKdGQ_fjVo_Pc299jwm7bHH78AaZ_tuM0BG3RzG3irxocsSebj2xue0DAt662pbIjPi6LkY6i2jnd9Sjj6szyhgr1ta1D4-zPwmrLrJUUGa34HJQzZYnL_NJ567VgFD0JpSi-jZ08b-1ZAqPFG_BxACi1By9gFaHTy2PpplqfCOUnxlK8gN6nI0fmxJpLiiPU3xcYdMx7hP4GjHIGyklKeRggtggCUCCGN7Qkb0vGsXpvOGavuZE7_WD9juyE6DYQNoT5n_fKtchcI-mV6WV_2JxBFqLI
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fS8MwEA46H_Rp_pj42zz42tk2adL6JlOZug4ZG-xtXNMUxLGKa33wr_euaxUFwbcSKIQkd993yX13jF1IkCHakHEyqTBAQQB0QuNnJNQFmUbCJC4JheOh6k_kwzSY1mL1Sgtjra2Sz2yXPqu3_DQ3JV2VoYVrhQcuXGcbCPwyWMm1muMTaIRvj9jOyhEjUmtdV_PxZHApFFIhH6NURUXOgh_tVCo0uWuzYTOPVRLJS7cskq75-FWi8d8T3Wadb-Eef_qCpB22Zhe7rN10buC1Ie-xOB71rq94TO20jDOnxCE-yvOC96BcWt5UKuHIaHlMKXvL2qXw92fglWjXifMU5vwGCuiwyd3tuNd36sYKDvgyLJyULj5N5BoBvsY9cTMAKbQELSProdvJIuMmaZYIayVFVK6CLExEhvTHhIEU-6y1yBf2gPEA__EsvRIoI-nRMxRCC88DJXxIInPIOrQ8s9dV7YxZszJHf4yfs83-OB7MBvfDx2O25SOFIKTw9QlrFW-lPUUKUCRn1cZ_Ahj-q_8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=MRCA%3A+Metric-level+Root+Cause+Analysis+for+Microservices+via+Multi-Modal+Data&rft.au=Wang%2C+Yidan&rft.au=Zhu%2C+Zhouruixing&rft.au=Fu%2C+Qiuai&rft.au=Ma%2C+Yuchi&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1057&rft.epage=1068&rft_id=info:doi/10.1145%2F3691620.3695485&rft.externalDocID=10764888