Log‐based anomaly detection for distributed systems: State of the art, industry experience, and open issues

Distributed systems have been widely used in many safety‐critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance...

Full description

Saved in:
Bibliographic Details
Published inJournal of software : evolution and process Vol. 36; no. 8
Main Authors Wei, Xinjie, Wang, Jie, Sun, Chang‐ai, Towey, Dave, Zhang, Shoufeng, Zuo, Wanqing, Yu, Yiming, Ruan, Ruoyi, Song, Guyang
Format Journal Article
LanguageEnglish
Published Chichester Wiley Subscription Services, Inc 01.08.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Distributed systems have been widely used in many safety‐critical areas. Any abnormalities (e.g., service interruption or service quality degradation) could lead to application crashes or decrease user satisfaction. These things may cause serious economic losses. Among the various quality assurance approaches for distributed systems, log‐based anomaly detection (LAD) has become a popular research topic. Its popularity relates to system logs being able to record and reveal important run‐time information. This paper presents a general LAD framework for distributed systems. Log grouping and feature‐pattern mining are two crucial LAD components that impact on the anomaly‐detection effectiveness. We also present a systematic survey of techniques in these two directions; propose classification frameworks for log grouping and feature patterns; and summarize four log‐grouping techniques and five feature patterns (which refer to invariant relationships among logs that can be used for anomaly detection). To evaluate their applicability, we report on the findings when applying existing techniques to Ray, a popular industrial distributed system. Based on these findings, several open issues are identified, which provide potential guidance for future research and development. A log‐based anomaly‐detection framework that identifies and highlights log‐grouping and feature‐pattern mining steps. A systematic review of state‐of‐the‐art log‐grouping and feature‐pattern mining techniques. An industry experience using Ray, Hadoop, and BlueGene/L reveals the gaps between theory and industrial settings, pointing out several open issues based on these gaps.
ISSN:2047-7473
2047-7481
DOI:10.1002/smr.2650