Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the s...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on parallel and distributed systems Vol. 33; no. 1; pp. 88 - 100
Main Authors	Yeung, Gingfung, Borowiec, Damian, Yang, Renyu, Friday, Adrian, Harper, Richard, Garraghan, Peter
Format	Journal Article
Language	English
Published	New York IEEE 01.01.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Accelerators cloud computing Computational modeling Deep learning Distributed systems Experimentation GPU utilization Graphics processing units Hardware Interference Kernel Load modeling Predictive models Production Reduction Resource utilization workload prediction
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!