Model training method and device and cluster system

The embodiment of the invention discloses a model training method and device and a cluster system, and relates to the technical field of artificial intelligence. The specific implementation scheme isas follows: in the aspect of hardware, a control node and at least one computing node are interconnec...

Full description

Saved in:
Bibliographic Details
Main Authors ZHANG HENGHUA, LI ZHI, DING RUIQUAN, LUO BAOTONG, HU ZAIBIN, HUANG KAIWEN
Format Patent
LanguageChinese
English
Published 23.06.2020
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The embodiment of the invention discloses a model training method and device and a cluster system, and relates to the technical field of artificial intelligence. The specific implementation scheme isas follows: in the aspect of hardware, a control node and at least one computing node are interconnected through a network, and a GPU is introduced into the computing node as a computing resource, sothat the hardware capability of a cluster system is greatly improved, and the efficiency of model training is further improved. In the aspect of software, a surm framework is optimized, and a client,a super management platform and the like are introduced, so that the cluster system is more convenient to use. 本申请实施例公开了一种模型训练方法、装置及集群系统,涉及人工智能技术领域。具体实现方案为:硬件方面,通过将控制节点和至少一个计算节点通过网络互连,在计算节点中引入GPU作为计算资源,从而大幅度提升集群系统的硬件能力,进而提升模型训练的效率。软件方面,通过对slurm框架进行优化,引入客户端、超级管理平台等,使得集群系统用起来更方便。
Bibliography:Application Number: CN202010080825