Predictive Models of Hard Drive Failures Based on Operational Data

Hard drives are an essential component of modern data storage. In order to reduce the risk of data loss, hard drive failure prediction methods using the Self-Monitoring, Analysis and Reporting Technology attributes have been proposed. However, these methods were developed from datasets not necessari...

Full description

Saved in:

Bibliographic Details
Published in	2017 16th IEEE International Conference on Machine Learning and Applications (ICMLA) pp. 619 - 625
Main Authors	Aussel, Nicolas, Jaulin, Samuel, Gandon, Guillaume, Petetin, Yohan, Fazli, Eriza, Chabridon, Sophie
Format	Conference Proceeding
Language	English
Published	IEEE 01.12.2017
Subjects	Drives Feature extraction Hidden Markov models Predictive models Support vector machines Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Hard drives are an essential component of modern data storage. In order to reduce the risk of data loss, hard drive failure prediction methods using the Self-Monitoring, Analysis and Reporting Technology attributes have been proposed. However, these methods were developed from datasets not necessarily representative of operational systems. In this paper, we consider the Backblaze public dataset, a recent operational dataset from over 47,000 drives, exhibiting hard drive heterogeneity with 81 models from 5 manufacturers, an extremely unbalanced ratio of 5000:1 between healthy and failure samples and a realworld loosely controlled environment. We observe that existing predictive models no longer perform sufficiently well on this dataset. We therefore selected machine learning classification methods able to deal with a very unbalanced training set, namely SVM, RF and GBT, and adapted them to the specific constraints of hard drive failure prediction. Our results reach over 95% precision and 67% recall on a one year real-world public dataset of over 12 million records with only 2586 failures.
DOI:	10.1109/ICMLA.2017.00-92