Understanding the Performance and Estimating the Cost of LLM Fine-Tuning
Due to the cost-prohibitive nature of training Large Language Models (LLMs), fine-tuning has emerged as an attractive alternative for specializing LLMs for specific tasks using limited compute resources in a cost-effective manner. In this paper, we characterize sparse Mixture of Experts (MoE) based...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
08.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Due to the cost-prohibitive nature of training Large Language Models (LLMs),
fine-tuning has emerged as an attractive alternative for specializing LLMs for
specific tasks using limited compute resources in a cost-effective manner. In
this paper, we characterize sparse Mixture of Experts (MoE) based LLM
fine-tuning to understand their accuracy and runtime performance on a single
GPU. Our evaluation provides unique insights into the training efficacy of
sparse and dense versions of MoE models, as well as their runtime
characteristics, including maximum batch size, execution time breakdown,
end-to-end throughput, GPU hardware utilization, and load distribution. Our
study identifies the optimization of the MoE layer as crucial for further
improving the performance of LLM fine-tuning. Using our profiling results, we
also develop and validate an analytical model to estimate the cost of LLM
fine-tuning on the cloud. This model, based on parameters of the model and GPU
architecture, estimates LLM throughput and the cost of training, aiding
practitioners in industry and academia to budget the cost of fine-tuning a
specific model. |
---|---|
DOI: | 10.48550/arxiv.2408.04693 |