An improved ant colony optimization algorithm with fault tolerance for job scheduling in grid computing systems

The Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing job execution time. Resource failure in Grid is no longer an exception but a regular occurring event as resources are increasingly being used by the scientific community to solve...

Full description

Saved in:

Bibliographic Details
Published in	PloS one Vol. 12; no. 5; p. e0177567
Main Authors	Idris, Hajara, Ezugwu, Absalom E, Junaidu, Sahalu B, Adewumi, Aderemi O
Format	Journal Article
Language	English
Published	United States Public Library of Science 17.05.2017 Public Library of Science (PLoS)
Subjects	Ad hoc networks Algorithms Analysis Animals Ant colony optimization Ants - physiology Avoidance Biology and Life Sciences Business machines Checkpointing Cloud computing Clustering Colleges & universities Colonies Complexity Computational grids Computer and Information Sciences Computer applications Computer engineering Computer networks Computer science Computer simulation Computers Computing time Concurrency Crashes Distributed processing Electrical engineering Engineering Engineering and Technology Environment Environments Failure Fault tolerance Gene mapping Genetic algorithms Geographical distribution Grid computing Immunological tolerance Institutions Integration International conferences Job hunting Mathematical Computing Mathematics Migration Mobile communication systems Models, Theoretical Networking Optimization Physical Sciences Recovery Reliability Research and Analysis Methods Resource management Scheduling Servers Simulation Social Sciences Statistics Wireless networks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The Grid scheduler, schedules user jobs on the best available resource in terms of resource characteristics by optimizing job execution time. Resource failure in Grid is no longer an exception but a regular occurring event as resources are increasingly being used by the scientific community to solve computationally intensive problems which typically run for days or even months. It is therefore absolutely essential that these long-running applications are able to tolerate failures and avoid re-computations from scratch after resource failure has occurred, to satisfy the user's Quality of Service (QoS) requirement. Job Scheduling with Fault Tolerance in Grid Computing using Ant Colony Optimization is proposed to ensure that jobs are executed successfully even when resource failure has occurred. The technique employed in this paper, is the use of resource failure rate, as well as checkpoint-based roll back recovery strategy. Check-pointing aims at reducing the amount of work that is lost upon failure of the system by immediately saving the state of the system. A comparison of the proposed approach with an existing Ant Colony Optimization (ACO) algorithm is discussed. The experimental results of the implemented Fault Tolerance scheduling algorithm show that there is an improvement in the user's QoS requirement over the existing ACO algorithm, which has no fault tolerance integrated in it. The performance evaluation of the two algorithms was measured in terms of the three main scheduling performance metrics: makespan, throughput and average turnaround time.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Conceptualization: HI AEE.Data curation: HI AEE AOA SBJ.Formal analysis: HI AEE AOA SBJ.Investigation: HI AEE AOA SBJ.Methodology: HI SBJ.Project administration: AOA SBJ.Resources: HI AEE AOA SBJ.Software: HI AEE AOE.Supervision: AOA SBJ.Validation: HI AEE AOA SBJ.Visualization: HI AEE.Writing – original draft: HI AEE AOA SBJ.Writing – review & editing: HI AEE AOA SBJ. Competing Interests: The authors have declared that no competing interests exist.
ISSN:	1932-6203 1932-6203
DOI:	10.1371/journal.pone.0177567