GPU Shared Scheduling System Under Deep Learning Container Cloud Platform
In recent years, containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However, the deep learning cloud platform still has deficiencies in GPU resource management, which are mainly manifested as multiple...
Saved in:
Published in | Ji suan ji ke xue Vol. 50; no. 6; p. 86 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | Chinese |
Published |
Chongqing
Guojia Kexue Jishu Bu
01.01.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In recent years, containers have gradually replaced virtual machines and are widely used in deep learning cloud platforms due to their lightweight and high scalability.However, the deep learning cloud platform still has deficiencies in GPU resource management, which are mainly manifested as multiple containers cannot share GPU resources due to the limitation of container orchestration technology.For some small-scale model training tasks and model inference tasks, a single task cannot fully utilize the computing resources of the entire GPU card.The current exclusive mode will result in a waste of expensive GPU resources, reduce resource efficiency and service availability.In response to this problem, this paper proposes a GPU sharing sche-duling system.On the one hand, the Kubernetes-based Operator mechanism extends the existing cluster functions, enabling multiple Pods to share GPU resources, and designs an agent mechanism to ensure that compatibility with native Kubernetes.On the other hand, based on the GPU |
---|---|
ISSN: | 1002-137X |