Improving Batch Scheduling on Blue Gene/Q by Relaxing Network Allocation Constraints

As systems scale toward exascale, many resources will become increasingly constrained. While some of these resources have historically been explicitly allocated, many-such as network bandwidth, I/O bandwidth, or power-have not. As systems continue to evolve, we expect many such resources to become e...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on parallel and distributed systems Vol. 27; no. 11; pp. 3269 - 3282
Main Authors Zhou Zhou, Xu Yang, Zhiling Lan, Rich, Paul, Wei Tang, Morozov, Vitali, Desai, Narayan
Format Journal Article
LanguageEnglish
Published New York IEEE 01.11.2016
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:As systems scale toward exascale, many resources will become increasingly constrained. While some of these resources have historically been explicitly allocated, many-such as network bandwidth, I/O bandwidth, or power-have not. As systems continue to evolve, we expect many such resources to become explicitly managed. This change will pose critical challenges to resource management and job scheduling. In this paper, we explore the potential of relaxing network allocation constraints for Blue Gene systems. Our objective is to improve the batch scheduling performance, where the partition-based interconnect architecture provides a unique opportunity to explicitly allocate network resources to jobs. This paper makes three major contributions. The first is substantial benchmarking of parallel applications, focusing on assessing application sensitivity to communication bandwidth at large scale. The second is three new scheduling schemes using relaxed network allocation and targeted at balancing individual job performance with overall system performance. The third is a comparative study of our scheduling schemes versus the existing scheduler on Mira, a 48-rack Blue Gene/Q system at Argonne National Laboratory. Specifically, we use job traces collected from this production system.
ISSN:1045-9219
1558-2183
DOI:10.1109/TPDS.2016.2528247