Automated IT system failure prediction: A deep learning approach

In mission critical IT services, system failure prediction becomes increasingly important; it prevents unexpected system downtime, and assures service reliability for end users. While operational console logs record rich and descriptive information on the health status of those IT systems, existing...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE International Conference on Big Data (Big Data) pp. 1291 - 1300
Main Authors Ke Zhang, Jianwu Xu, Min, Martin Renqiang, Guofei Jiang, Pelechrinis, Konstantinos, Zhang, Hui
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2016
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In mission critical IT services, system failure prediction becomes increasingly important; it prevents unexpected system downtime, and assures service reliability for end users. While operational console logs record rich and descriptive information on the health status of those IT systems, existing system management technologies mostly use them in a labor-intensive forensics approach, i.e., identifying what went wrong after the fact. Recent efforts on log-based system management take an automation approach with text mining techniques, such as term frequency - inverse document frequency (TF-IDF). However, those techniques lead to a high-dimensional feature space, and are not easily generalizable to heterogeneous log formats. In this paper, we present a novel system that automatically parses streamed console logs and detects early warning signals for IT system failure prediction. In particular, our solution includes a log pattern extraction method by clustering together logs with similar format and content. We then resemble the TF-IDF idea by considering each pattern as a word and the set of patterns in each discretized epoch as a document. This leads to a feature space with significantly lower dimensionality that can provide robust signals for the status of the system. As system failures tend to occur very rare, we apply a recurrent neural network, namely, Long Short-Term Memory (LSTM), to deal with the "rarity" of labeled data in the training process. LSTM is able to capture the long-range dependency across sequences, therefore outperforms traditional supervised learning methods in our application domain. We evaluated and compared our proposed technology with state-of-the-art machine learning approaches using real log traces from two large enterprise systems. The results showed the advantage and potentials of our system in prediction of complex IT failures. To our knowledge, our work is the first that employs LSTM for log-based system failure prediction.
DOI:10.1109/BigData.2016.7840733