Multi-failure detection using device hierarchical attention network

With rapid developments in the information industry, data centers have become increasingly important for collecting and storing data. The devices in data centers are not only connected to external machines to provide a variety of services, but they also store vast amounts of data, as device failures...

Full description

Saved in:
Bibliographic Details
Published inExpert systems with applications Vol. 203; p. 117277
Main Authors An, Sangjun, Kim, Mintae, Kim, Wooju
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.10.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:With rapid developments in the information industry, data centers have become increasingly important for collecting and storing data. The devices in data centers are not only connected to external machines to provide a variety of services, but they also store vast amounts of data, as device failures in data centers can result in fatal and heavy economic damage. Various methods have been studied in recent years to effectively predict failures in connected devices. However, in data center-scale systems, there is a problem of low frequency of failure when predicting the failure for each device. In addition, there are complex failures that may occur within the data center owing to a mix of devices and systems, and it is difficult to determine the cause of failure in such cases. In this study, we present a device hierarchical attention network (DHAN) methodology that can predict all device failures by simultaneously using existing device information regarding the devices in the data center. Because the devices in the data center could potentially affect each other, this information regarding the device is used in a composite manner. When using information from a single device, it was observed that failure could be predicted more effectively compared to the results obtained from failure prediction. In addition, by extracting attention information from the DHAN model, we identified a device that plays an important role in predicting the failure of a particular device. Thereafter, we utilized it to cluster and reconstruct the DHAN model and identify the results of predicting failures more effectively. Based on the results presented herein, it is expected that the proposed system can be stably maintained and repaired by identifying the potential impact of the devices. •A network model is proposed to predict multi-device failure in a data center.•A helpful and relevant device subset can be obtained to predict failure.•Our model has excellent performance due to use of relevant device information.
ISSN:0957-4174
1873-6793
DOI:10.1016/j.eswa.2022.117277