Training Overhead Ratio: A Practical Reliability Metric for Large Language Model Training Systems

Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lac...

Full description

Saved in:

Bibliographic Details
Main Authors	Lu, Ning, Xie, Qian, Zhang, Hao, Fang, Wenyi, Zheng, Yang, Hu, Zheng, Ma, Jiantao
Format	Journal Article
Language	English
Published	14.08.2024
Subjects	Computer Science - Artificial Intelligence Computer Science - Distributed, Parallel, and Cluster Computing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Large Language Models (LLMs) are revolutionizing the AI industry with their superior capabilities. Training these models requires large-scale GPU clusters and significant computing time, leading to frequent failures that significantly increase training costs. Despite its significance, this field lacks a metric for evaluating reliability. In this work, we introduce a novel reliability metric called \emph{Training Overhead Ratio} (TOR) to evaluate the reliability of fault-tolerant LLM training systems. TOR is defined as the ratio of optimal training time to the observed training time of a system, serving as a practical tool for users to estimate the actual time required to train an LLM on a given system. Furthermore, our investigation identifies the key factor for enhancing reliability and present TOR equations for various types of failures encountered in practice.
DOI:	10.48550/arxiv.2408.07482