Understanding Multi-Dimensional Efficiency of Fine-Tuning Large Language Models Using SpeedUp, MemoryUp, and EnergyUp

Training large language models (LLMs) from scratch is extremely time-consuming and computationally expensive. Fine-tuning provides an effective approach that skips the initial stages of training and focuses on adapting LLMs to downstream applications with lower demand in resources and reduced cost....

Full description

Saved in:

Bibliographic Details
Published in	2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 929 - 937
Main Authors	Chen, Dayuan, Soto, Noe, Tuttle, Jonas F., Zong, Ziliang
Format	Conference Proceeding
Language	English
Published	IEEE 27.05.2024
Subjects	Distributed processing Energy consumption energy efficiency fine-tuning Large language models Measurement Memory management memory usage optimizations Runtime Training training time
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Training large language models (LLMs) from scratch is extremely time-consuming and computationally expensive. Fine-tuning provides an effective approach that skips the initial stages of training and focuses on adapting LLMs to downstream applications with lower demand in resources and reduced cost. A variety of optimizations have been proposed to further optimize the fine-tuning process to reduce training time, memory usage or energy consumption. This introduces a new challenge of selecting the appropriate optimizations that best fit different optimization goals and priorities. This paper presents a detailed analysis of the runtime, memory, and energy efficiency of five optimizations during the three phases of fine-tuning the facebook/opt-350m model, utilizing the DeepSpeed-Chat framework. We propose a framework, comprising of SpeedUp, MemoryUp, and EnergyUp metrics, that can quantitatively evaluate different optimizations across multiple efficiency dimensions. Our research demonstrates that different optimizations have varying impacts on the three efficiency dimensions considered. There is no universally applicable solution to enhance all dimensions simultaneously. Among the five optimizations investigated, ZeRO1 consistently exhibits commendable SpeedUp across all three fine-tuning steps, regardless of hardware resources. Gradient checkpointing excels in MemoryUp during the initial two fine-tuning steps but has the lowest EnergyUp. In addition, current fine-tuning optimizations are not effective in reducing energy consumption. This study provides valuable insights into the selection of appropriate optimizations for runtime, memory usage, and energy consumption during the fine-tuning process of large language models.
DOI:	10.1109/IPDPSW63119.2024.00162