Understanding Multi-Dimensional Efficiency of Fine-Tuning Large Language Models Using SpeedUp, MemoryUp, and EnergyUp
Training large language models (LLMs) from scratch is extremely time-consuming and computationally expensive. Fine-tuning provides an effective approach that skips the initial stages of training and focuses on adapting LLMs to downstream applications with lower demand in resources and reduced cost....
Saved in:
Published in | 2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 929 - 937 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
27.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Training large language models (LLMs) from scratch is extremely time-consuming and computationally expensive. Fine-tuning provides an effective approach that skips the initial stages of training and focuses on adapting LLMs to downstream applications with lower demand in resources and reduced cost. A variety of optimizations have been proposed to further optimize the fine-tuning process to reduce training time, memory usage or energy consumption. This introduces a new challenge of selecting the appropriate optimizations that best fit different optimization goals and priorities. This paper presents a detailed analysis of the runtime, memory, and energy efficiency of five optimizations during the three phases of fine-tuning the facebook/opt-350m model, utilizing the DeepSpeed-Chat framework. We propose a framework, comprising of SpeedUp, MemoryUp, and EnergyUp metrics, that can quantitatively evaluate different optimizations across multiple efficiency dimensions. Our research demonstrates that different optimizations have varying impacts on the three efficiency dimensions considered. There is no universally applicable solution to enhance all dimensions simultaneously. Among the five optimizations investigated, ZeRO1 consistently exhibits commendable SpeedUp across all three fine-tuning steps, regardless of hardware resources. Gradient checkpointing excels in MemoryUp during the initial two fine-tuning steps but has the lowest EnergyUp. In addition, current fine-tuning optimizations are not effective in reducing energy consumption. This study provides valuable insights into the selection of appropriate optimizations for runtime, memory usage, and energy consumption during the fine-tuning process of large language models. |
---|---|
DOI: | 10.1109/IPDPSW63119.2024.00162 |