Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems
Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lac...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Large Language Models (LLMs) are revolutionizing the AI industry with their
superior capabilities. Training these models requires large-scale GPU clusters
and significant computing time, leading to frequent failures that significantly
increase training costs. Despite its significance, this field lacks a metric
for evaluating reliability. In this work, we introduce a novel reliability
metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability
of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal
training time to the observed training time of a system, serving as a practical
tool for users to estimate the actual time required to train an LLM on a given
system. Furthermore, our investigation identifies the key factor for enhancing
reliability and present TOR equations for various types of failures encountered
in practice. |
---|---|
DOI: | 10.48550/arxiv.2408.07482 |