Horus: Interference-Aware and Prediction-Based Scheduling in Deep Learning Systems

To accelerate the training of Deep Learning (DL) models, clusters of machines equipped with hardware accelerators such as GPUs are leveraged to reduce execution time. State-of-the-art resource managers are needed to increase GPU utilization and maximize throughput. While co-locating DL jobs on the s...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on parallel and distributed systems Vol. 33; no. 1; pp. 88 - 100
Main Authors Yeung, Gingfung, Borowiec, Damian, Yang, Renyu, Friday, Adrian, Harper, Richard, Garraghan, Peter
Format Journal Article
LanguageEnglish
Published New York IEEE 01.01.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…