Understanding Multi-Dimensional Efficiency of Fine-Tuning Large Language Models Using SpeedUp, MemoryUp, and EnergyUp

Training large language models (LLMs) from scratch is extremely time-consuming and computationally expensive. Fine-tuning provides an effective approach that skips the initial stages of training and focuses on adapting LLMs to downstream applications with lower demand in resources and reduced cost....

Full description

Saved in:
Bibliographic Details
Published in2024 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) pp. 929 - 937
Main Authors Chen, Dayuan, Soto, Noe, Tuttle, Jonas F., Zong, Ziliang
Format Conference Proceeding
LanguageEnglish
Published IEEE 27.05.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Training large language models (LLMs) from scratch is extremely time-consuming and computationally expensive. Fine-tuning provides an effective approach that skips the initial stages of training and focuses on adapting LLMs to downstream applications with lower demand in resources and reduced cost. A variety of optimizations have been proposed to further optimize the fine-tuning process to reduce training time, memory usage or energy consumption. This introduces a new challenge of selecting the appropriate optimizations that best fit different optimization goals and priorities. This paper presents a detailed analysis of the runtime, memory, and energy efficiency of five optimizations during the three phases of fine-tuning the facebook/opt-350m model, utilizing the DeepSpeed-Chat framework. We propose a framework, comprising of SpeedUp, MemoryUp, and EnergyUp metrics, that can quantitatively evaluate different optimizations across multiple efficiency dimensions. Our research demonstrates that different optimizations have varying impacts on the three efficiency dimensions considered. There is no universally applicable solution to enhance all dimensions simultaneously. Among the five optimizations investigated, ZeRO1 consistently exhibits commendable SpeedUp across all three fine-tuning steps, regardless of hardware resources. Gradient checkpointing excels in MemoryUp during the initial two fine-tuning steps but has the lowest EnergyUp. In addition, current fine-tuning optimizations are not effective in reducing energy consumption. This study provides valuable insights into the selection of appropriate optimizations for runtime, memory usage, and energy consumption during the fine-tuning process of large language models.
DOI:10.1109/IPDPSW63119.2024.00162