Predictive Models of Hard Drive Failures Based on Operational Data
Hard drives are an essential component of modern data storage. In order to reduce the risk of data loss, hard drive failure prediction methods using the Self-Monitoring, Analysis and Reporting Technology attributes have been proposed. However, these methods were developed from datasets not necessari...
Saved in:
Published in | 2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) pp. 619 - 625 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.12.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Hard drives are an essential component of modern data storage. In order to reduce the risk of data loss, hard drive failure prediction methods using the Self-Monitoring, Analysis and Reporting Technology attributes have been proposed. However, these methods were developed from datasets not necessarily representative of operational systems. In this paper, we consider the Backblaze public dataset, a recent operational dataset from over 47,000 drives, exhibiting hard drive heterogeneity with 81 models from 5 manufacturers, an extremely unbalanced ratio of 5000:1 between healthy and failure samples and a realworld loosely controlled environment. We observe that existing predictive models no longer perform sufficiently well on this dataset. We therefore selected machine learning classification methods able to deal with a very unbalanced training set, namely SVM, RF and GBT, and adapted them to the specific constraints of hard drive failure prediction. Our results reach over 95% precision and 67% recall on a one year real-world public dataset of over 12 million records with only 2586 failures. |
---|---|
DOI: | 10.1109/ICMLA.2017.00-92 |