Drill: Log-based Anomaly Detection for Large-scale Storage Systems Using Source Code Analysis

Large-scale storage systems, a critical part of modern computing systems, are subject to various runtime bugs, failures, and anomalies in production. Identifying their anomalies at runtime is thus critical for users and administrators. Since runtime logs record the important status of the systems, l...

Full description

Saved in:
Bibliographic Details
Published inProceedings - IEEE International Parallel and Distributed Processing Symposium pp. 189 - 199
Main Authors Zhang, Di, Egersdoerfer, Chris, Mahmud, Tabassum, Zheng, Mai, Dai, Dong
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.05.2023
Subjects
Online AccessGet full text
ISSN1530-2075
DOI10.1109/IPDPS54959.2023.00028

Cover

More Information
Summary:Large-scale storage systems, a critical part of modern computing systems, are subject to various runtime bugs, failures, and anomalies in production. Identifying their anomalies at runtime is thus critical for users and administrators. Since runtime logs record the important status of the systems, log-based anomaly detection has been studied extensively for timely identifying system malfunctions. However, existing log-based anomaly detection solutions share common limitations in representing log entries accurately and robustly, hence can not effectively handle log entries that were not seen in the historical logs, which is a common real-world scenario due to logs' inherent rarity and the continuous evolution of the systems. To address the issues of existing methods, we propose Drill, a new log pre-processing method to generate high-quality vector representation of runtime logs by leveraging both storage system-specific sentiment-classifying language models and log contexts built from the source code. Through extensive evaluations of two representative distributed storage systems (Apache HDFS and Lustre), we show that Drill can achieve up to 41% improvement when compared with state-of-the-art anomaly detection solutions, showing it is a promising solution for general anomaly detection.
ISSN:1530-2075
DOI:10.1109/IPDPS54959.2023.00028