Distributed Recovery for Enterprise Services

Small-to medium-scale enterprise systems are typically complex and highly specialized, but lack the management resources that can be devoted to large-scale (e.g., Cloud) systems, making them extremely challenging to manage. Here we present an adaptive algorithm for addressing a common management pro...

Full description

Saved in:

Bibliographic Details
Published in	2015 IEEE 9th International Conference on Self-Adaptive and Self-Organizing Systems pp. 111 - 120
Main Authors	Clark, Shane S., Beal, Jacob, Pal, Partha
Format	Conference Proceeding
Language	English
Published	IEEE 01.09.2015
Subjects	aggregate programming distributed algorithms Electronic mail enterprise systems Logic gates Monitoring protelis Reliability Servers Sockets
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Small-to medium-scale enterprise systems are typically complex and highly specialized, but lack the management resources that can be devoted to large-scale (e.g., Cloud) systems, making them extremely challenging to manage. Here we present an adaptive algorithm for addressing a common management problem in enterprise service networks: safely and rapidly recovering from the failure of one or more services. Due to poorly documented and shifting dependencies, a typical industry practice for this situation is to bring the entire system down, then to restart services one at a time in a predefined order. We improve on this practice with the Dependency-Directed Recovery (DDR) algorithm, which senses dependencies by observing network interactions and recovers near-optimally from failures following a distributed graph algorithm. Our Java-based implementation of this system is suitable for deployment with a wide variety of networked enterprise services, and we validate its correct operation and advantage over fixed-order restart with emulation experiments on networks of up to 20 services.
ISSN:	1949-3673
DOI:	10.1109/SASO.2015.19