Model training method and device and cluster system
The embodiment of the invention discloses a model training method and device and a cluster system, and relates to the technical field of artificial intelligence. The specific implementation scheme isas follows: in the aspect of hardware, a control node and at least one computing node are interconnec...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Patent |
Language | Chinese English |
Published |
23.06.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The embodiment of the invention discloses a model training method and device and a cluster system, and relates to the technical field of artificial intelligence. The specific implementation scheme isas follows: in the aspect of hardware, a control node and at least one computing node are interconnected through a network, and a GPU is introduced into the computing node as a computing resource, sothat the hardware capability of a cluster system is greatly improved, and the efficiency of model training is further improved. In the aspect of software, a surm framework is optimized, and a client,a super management platform and the like are introduced, so that the cluster system is more convenient to use.
本申请实施例公开了一种模型训练方法、装置及集群系统,涉及人工智能技术领域。具体实现方案为:硬件方面,通过将控制节点和至少一个计算节点通过网络互连,在计算节点中引入GPU作为计算资源,从而大幅度提升集群系统的硬件能力,进而提升模型训练的效率。软件方面,通过对slurm框架进行优化,引入客户端、超级管理平台等,使得集群系统用起来更方便。 |
---|---|
Bibliography: | Application Number: CN202010080825 |