AdapCC: Making Collective Communication in Distributed Machine Learning Adaptive

As deep learning (DL) models continue to grow in size, there is a pressing need for distributed model learning using a large number of devices (e.g., G PU s) and servers. Collective communication among devices/servers (for gradient synchronization, intermediate data exchange, etc.) introduces signif...

Full description

Saved in:
Bibliographic Details
Published inProceedings of the International Conference on Distributed Computing Systems pp. 25 - 35
Main Authors Zhao, Xiaoyang, Zhang, Zhe, Wu, Chuan
Format Conference Proceeding
LanguageEnglish
Published IEEE 23.07.2024
Subjects
Online AccessGet full text

Cover

Loading…