GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

In recent years, containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However, the deep learning cloud platform still has deficiencies in GPU resource management, which are mainly manifested as multiple...

Full description

Saved in:
Bibliographic Details
Published inJi suan ji ke xue Vol. 50; no. 6; p. 86
Main Authors Wang, Zhuang, Wang, Pinghui, Wang, Bincheng, Wu, Wenbo, Wang, Bin, Cong, Pengyu
Format Journal Article
LanguageChinese
Published Chongqing Guojia Kexue Jishu Bu 01.01.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In recent years, containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However, the deep learning cloud platform still has deficiencies in GPU resource management, which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks, a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources, reduce resource efficiency and service availability.In response to this problem, this paper proposes a GPU sharing sche-duling system.On the one hand, the Kubernetes-based Operator mechanism extends the existing cluster functions, enabling multiple Pods to share GPU resources, and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand, based on the GPU
ISSN:1002-137X