Impact of Failure Prediction on Availability: Modeling and Comparative Analysis of Predictive and Reactive Methods

Predicting failures and acting proactively have a potential to improve availability as a correct prediction and a successful mitigation may bring a reward resulting in decrease of downtime and availability improvement. But, conversely, each incorrect prediction may introduce additional downtime (pen...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on dependable and secure computing Vol. 17; no. 3; pp. 493 - 505
Main Authors Kaitovic, Igor, Malek, Miroslaw
Format Journal Article
LanguageEnglish
Published Washington IEEE 01.05.2020
IEEE Computer Society
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Predicting failures and acting proactively have a potential to improve availability as a correct prediction and a successful mitigation may bring a reward resulting in decrease of downtime and availability improvement. But, conversely, each incorrect prediction may introduce additional downtime (penalty). Therefore, depending on the quality of prediction and the system parameters, predictive fault-tolerance methods may improve or may degrade availability in comparison to the reactive ones. We first derive taxonomies of fault-tolerant techniques and policies to differentiate between reactive and proactive policies that are further classified as systematic and predictive. To evaluate whether a predictive policy improves availability or not, we derive an analytical model for availability quantification. We use Markov chains to extend steady-state availability equation to include: precision and recall, penalty and reward, mitigation success probability and potential failure rate increase due to the prediction load. We also derive an A-measure to optimize failure prediction for increasing availability. In our conclusion, precision and recall have comparable impact on availability as changing MTTF and MTTR. To validate the model we also simulate and analyze availability of a virtualized server with exponential distribution of failure and repair rates.
ISSN:1545-5971
1941-0018
DOI:10.1109/TDSC.2018.2806448