MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data
Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs,...
Saved in:
Published in | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] pp. 1057 - 1068 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
ACM
27.10.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 2643-1572 |
DOI | 10.1145/3691620.3695485 |
Cover
Abstract | Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs, and metrics) record the status of microservice systems, but most existing RCA approaches rely on single-source data, failing to understand the system fully. (2) Existing RCA approaches ignore the services' anomaly state and their anomaly intensity. (3) The service-level RCAs lack detailed information for quick issue resolution. To tackle these challenges, we propose MRCA, a metric-level RCA approach using multi-modal data. Our key insight is that using multi-modal data allows for a comprehensive understanding of the system, enabling the localization of root causes across more anomaly scenarios. MRCA first utilizes traces and logs to obtain the ranking list of abnormal services based on reconstruction probability. It further builds causal graphs from services with high anomaly probability to discover the order in which abnormal metrics of different services occur. By incorporating a reward mechanism, MRCA terminates the excessive expansion of the causal graph and significantly reduces the time taken for causal analysis. Finally, MRCA can prune the ranking list based on the causal graph and identify metric-level root causes. Experiments on two widely-used microservice benchmarks demonstrate that MRCA outperforms state-of-the-art approaches in terms of both accuracy and efficiency. |
---|---|
AbstractList | Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated Root Cause Analysis (RCA) is essential. However, existing RCA approaches face significant challenges. (1) Multi-modal data (e.g. traces, logs, and metrics) record the status of microservice systems, but most existing RCA approaches rely on single-source data, failing to understand the system fully. (2) Existing RCA approaches ignore the services' anomaly state and their anomaly intensity. (3) The service-level RCAs lack detailed information for quick issue resolution. To tackle these challenges, we propose MRCA, a metric-level RCA approach using multi-modal data. Our key insight is that using multi-modal data allows for a comprehensive understanding of the system, enabling the localization of root causes across more anomaly scenarios. MRCA first utilizes traces and logs to obtain the ranking list of abnormal services based on reconstruction probability. It further builds causal graphs from services with high anomaly probability to discover the order in which abnormal metrics of different services occur. By incorporating a reward mechanism, MRCA terminates the excessive expansion of the causal graph and significantly reduces the time taken for causal analysis. Finally, MRCA can prune the ranking list based on the causal graph and identify metric-level root causes. Experiments on two widely-used microservice benchmarks demonstrate that MRCA outperforms state-of-the-art approaches in terms of both accuracy and efficiency. |
Author | Zhu, Zhouruixing Wang, Yidan Ma, Yuchi Fu, Qiuai He, Pinjia |
Author_xml | – sequence: 1 givenname: Yidan surname: Wang fullname: Wang, Yidan email: phoebeyidanwang@gmail.com organization: The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen),Shenzhen,China – sequence: 2 givenname: Zhouruixing surname: Zhu fullname: Zhu, Zhouruixing email: zhouruixingzhu@link.cuhk.edu.cn organization: The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen),Shenzhen,China – sequence: 3 givenname: Qiuai surname: Fu fullname: Fu, Qiuai email: fuqiuai@huawei.com organization: Huawei Cloud Computing Technologies CO., LTD, China,Shenzhen,China – sequence: 4 givenname: Yuchi surname: Ma fullname: Ma, Yuchi email: mayuchi1@huawei.com organization: Huawei Cloud Computing Technologies CO., LTD, China,Shenzhen,China – sequence: 5 givenname: Pinjia surname: He fullname: He, Pinjia email: hepinjia@cuhk.edu.cn organization: The Chinese University of Hong Kong, Shenzhen (CUHK-Shenzhen),Shenzhen Research Institute of Big Data,Shenzhen,China |
BookMark | eNotj0tLw0AURkdRsNas3biYP5A6N_OMuxDrAxqEoutyM7kDA7GRTBrovzegq_OtDt-5ZVfH4UiM3YPYACj9KE0JphCbhVo5fcGy0pZOCWGhUM5eslVhlMxB2-KGZSnFVixTGwCzYk2zr6sn3tA0Rp_3NFPP98Mw8RpPiXh1xP6cYuJhGHkT_TgkGufoKfE5Im9O_RTzZuiw58844R27Dtgnyv65Zl8v28_6Ld99vL7X1S7H5dGUdyTB-VJ4iYV1qERAVNIqtKokcM6G0ou2C60kWgrACIPBtTIY0N5pJdfs4c8biejwM8ZvHM8HENYo55z8BdsuTyc |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1145/3691620.3695485 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEL url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9798400712487 |
EISSN | 2643-1572 |
EndPage | 1068 |
ExternalDocumentID | 10764888 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 – fundername: Shenzhen Research Institute of Big Data funderid: 10.13039/501100020785 |
GroupedDBID | 6IE 6IF 6IH 6IK 6IL 6IM 6IN 6J9 AAJGR AAWTH ABLEC ACREN ADYOE ADZIZ AFYQB ALMA_UNASSIGNED_HOLDINGS AMTXH BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IPLJI M43 OCL RIE RIL |
ID | FETCH-LOGICAL-a248t-de318c90c3a278a40faa4374a749e1887f9c0bdfb3ee4871606af8b3f615c8543 |
IEDL.DBID | RIE |
IngestDate | Wed Jan 15 06:20:39 EST 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a248t-de318c90c3a278a40faa4374a749e1887f9c0bdfb3ee4871606af8b3f615c8543 |
PageCount | 12 |
ParticipantIDs | ieee_primary_10764888 |
PublicationCentury | 2000 |
PublicationDate | 2024-Oct.-27 |
PublicationDateYYYYMMDD | 2024-10-27 |
PublicationDate_xml | – month: 10 year: 2024 text: 2024-Oct.-27 day: 27 |
PublicationDecade | 2020 |
PublicationTitle | IEEE/ACM International Conference on Automated Software Engineering : [proceedings] |
PublicationTitleAbbrev | ASE |
PublicationYear | 2024 |
Publisher | ACM |
Publisher_xml | – name: ACM |
SSID | ssib057256116 ssj0051577 |
Score | 2.285896 |
Snippet | Due to the complexity and dynamic nature of large-scale microservice systems, manual troubleshooting is time-consuming and impractical. Therefore, automated... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1057 |
SubjectTerms | Accuracy Complexity theory Location awareness Measurement Microservice architectures Microservices Multi-modal Reinforcement Learning Root cause analysis Software engineering Software reliability Stability analysis Thermal stability |
Title | MRCA: Metric-level Root Cause Analysis for Microservices via Multi-Modal Data |
URI | https://ieeexplore.ieee.org/document/10764888 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LS8NAEF5sT57qo-KbPXhNTbKb3Y03qZYipEix0FuZbDYglkZs4sFf70weioLgLSwElt2Z-WZ25pth7EqCNKhD1sulwgAFAdAzNsyJqAsyi4VNfSIKJzM1XciHZbRsyeo1F8Y5VxefuRF91rn8rLAVPZWhhmuFAmd6rIdy1pC1OuGJNIJ3QL5OY4YRp7Vue_kEMroWCh2hEGNURS3Ooh_DVGosmQzYrNtFU0LyMqrKdGQ_fjVo_Pc299jwm7bHH78AaZ_tuM0BG3RzG3irxocsSebj2xue0DAt662pbIjPi6LkY6i2jnd9Sjj6szyhgr1ta1D4-zPwmrLrJUUGa34HJQzZYnL_NJ567VgFD0JpSi-jZ08b-1ZAqPFG_BxACi1By9gFaHTy2PpplqfCOUnxlK8gN6nI0fmxJpLiiPU3xcYdMx7hP4GjHIGyklKeRggtggCUCCGN7Qkb0vGsXpvOGavuZE7_WD9juyE6DYQNoT5n_fKtchcI-mV6WV_2JxBFqLI |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3fS8MwEA46H_Rp_pj42zz42tk2adL6JlOZug4ZG-xtXNMUxLGKa33wr_euaxUFwbcSKIQkd993yX13jF1IkCHakHEyqTBAQQB0QuNnJNQFmUbCJC4JheOh6k_kwzSY1mL1Sgtjra2Sz2yXPqu3_DQ3JV2VoYVrhQcuXGcbCPwyWMm1muMTaIRvj9jOyhEjUmtdV_PxZHApFFIhH6NURUXOgh_tVCo0uWuzYTOPVRLJS7cskq75-FWi8d8T3Wadb-Eef_qCpB22Zhe7rN10buC1Ie-xOB71rq94TO20jDOnxCE-yvOC96BcWt5UKuHIaHlMKXvL2qXw92fglWjXifMU5vwGCuiwyd3tuNd36sYKDvgyLJyULj5N5BoBvsY9cTMAKbQELSProdvJIuMmaZYIayVFVK6CLExEhvTHhIEU-6y1yBf2gPEA__EsvRIoI-nRMxRCC88DJXxIInPIOrQ8s9dV7YxZszJHf4yfs83-OB7MBvfDx2O25SOFIKTw9QlrFW-lPUUKUCRn1cZ_Ahj-q_8 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=IEEE%2FACM+International+Conference+on+Automated+Software+Engineering+%3A+%5Bproceedings%5D&rft.atitle=MRCA%3A+Metric-level+Root+Cause+Analysis+for+Microservices+via+Multi-Modal+Data&rft.au=Wang%2C+Yidan&rft.au=Zhu%2C+Zhouruixing&rft.au=Fu%2C+Qiuai&rft.au=Ma%2C+Yuchi&rft.date=2024-10-27&rft.pub=ACM&rft.eissn=2643-1572&rft.spage=1057&rft.epage=1068&rft_id=info:doi/10.1145%2F3691620.3695485&rft.externalDocID=10764888 |