A Survey of End-to-End Modeling for Distributed DNN Training: Workloads, Simulators, and TCO
Distributed deep neural networks (DNNs) have become a cornerstone for scaling machine learning to meet the demands of increasingly complex applications. However, the rapid growth in model complexity far outpaces CMOS technology scaling, making sustainable and efficient system design a critical chall...
Saved in:
Main Authors | , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
10.06.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2506.09275 |
Cover
Loading…
Summary: | Distributed deep neural networks (DNNs) have become a cornerstone for scaling
machine learning to meet the demands of increasingly complex applications.
However, the rapid growth in model complexity far outpaces CMOS technology
scaling, making sustainable and efficient system design a critical challenge.
Addressing this requires coordinated co-design across software, hardware, and
technology layers. Due to the prohibitive cost and complexity of deploying
full-scale training systems, simulators play a pivotal role in enabling this
design exploration. This survey reviews the landscape of distributed DNN
training simulators, focusing on three major dimensions: workload
representation, simulation infrastructure, and models for total cost of
ownership (TCO) including carbon emissions. It covers how workloads are
abstracted and used in simulation, outlines common workload representation
methods, and includes comprehensive comparison tables covering both simulation
frameworks and TCO/emissions models, detailing their capabilities, assumptions,
and areas of focus. In addition to synthesizing existing tools, the survey
highlights emerging trends, common limitations, and open research challenges
across the stack. By providing a structured overview, this work supports
informed decision-making in the design and evaluation of distributed training
systems. |
---|---|
DOI: | 10.48550/arxiv.2506.09275 |