Topology-aware GPU job scheduling with deep reinforcement learning and heuristics

Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common...

Full description

Saved in:

Bibliographic Details
Published in	Journal of parallel and distributed computing Vol. 204; p. 105138
Main Authors	Ayadi, Hajer, An, Aijun, Shao, Yiming, Pourmedheji, Hossein, Deng, Junjie, Huang, Jimmy X., Feiman, Michael, Zhou, Hao
Format	Journal Article
Language	English
Published	Elsevier Inc 01.10.2025
Subjects	Cluster topology DRL GPU job scheduling Cluster topology GPU job scheduling DRL
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Deep neural networks (DNNs) have gained popularity in many fields such as computer vision, and natural language processing. However, the increasing size of data and complexity of models have made training DNNs time-consuming. While distributed DNN training using multiple GPUs in parallel is a common solution, it introduces challenges in GPU resource management and scheduling. One key challenge is minimizing communication costs among GPUs assigned to a DNN training job. High communication costs—arising from factors such as inter-rack or inter-machine data transfers—can lead to hardware bottlenecks and network delays, ultimately slowing down training. Reducing these costs facilitates more efficient data transfer and synchronization, directly accelerating the training process. Although deep reinforcement learning (DRL) has shown promise in GPU resource scheduling, existing methods often lack considerations for hardware topology. Moreover, most proposed GPU schedulers ignore the possibility of combining heuristic and DRL policies. In response to these challenges, we introduce TopDRL, an innovative hybrid scheduler that integrates deep reinforcement learning (DRL) and heuristic methods to enhance GPU job scheduling. TopDRL uses a multi-branch convolutional neural network (CNN) model for job selection and a heuristic method for GPU allocation. At each time step, the CNN model selects a job, and then a heuristic method selects available GPUs closest to each other from the cluster. Reinforcement learning (RL) is used to train the CNN model to select the job that maximizes throughput-based rewards. Extensive evaluation, conducted on datasets with real jobs, shows that TopDRL significantly outperforms six baseline schedulers that use heuristics or other DRL models for job picking and resource allocation. •TopDRL combines DRL with heuristics for scalable GPU cluster scheduling.•It is topology-aware in both the reward design and state representation.•The reward maximizes throughput and reduces waiting time using topology.•TopDRL uses a novel input combining jobs, topology, and GPU availability.•It outperforms Tetris by 47% in throughput and 42% in average JCT.
ISSN:	0743-7315
DOI:	10.1016/j.jpdc.2025.105138