GPU Shared Scheduling System Under Deep Learning Container Cloud Platform

In recent years, containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However, the deep learning cloud platform still has deficiencies in GPU resource management, which are mainly manifested as multiple...

Full description

Saved in:

Bibliographic Details
Published in	Ji suan ji ke xue Vol. 50; no. 6; p. 86
Main Authors	Wang, Zhuang, Wang, Pinghui, Wang, Bincheng, Wu, Wenbo, Wang, Bin, Cong, Pengyu
Format	Journal Article
Language	Chinese
Published	Chongqing Guojia Kexue Jishu Bu 01.01.2023
Subjects	Cloud computing Clusters Completion time Containers Deep learning Resource management Resource scheduling Scale models Small scale technology Task scheduling Training Virtual environments
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In recent years, containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However, the deep learning cloud platform still has deficiencies in GPU resource management, which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks, a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources, reduce resource efficiency and service availability.In response to this problem, this paper proposes a GPU sharing sche-duling system.On the one hand, the Kubernetes-based Operator mechanism extends the existing cluster functions, enabling multiple Pods to share GPU resources, and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand, based on the GPU
ISSN:	1002-137X