Dissecting the Runtime Performance of the Training, Fine-tuning, and Inference of Large Language Models

Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable compu...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Zhang, Longteng, Liu, Xiang, Li, Zeyu, Pan, Xinglin, Dong, Peijie, Fan, Ruibo, Guo, Rui, Wang, Xin, Luo, Qiong, Shi, Shaohuai, Chu, Xiaowen
Format	Paper Journal Article
Language	English
Published	Ithaca Cornell University Library, arXiv.org 01.12.2023
Subjects	Benchmarks Computation Computer Science - Computation and Language Computer Science - Learning Computer Science - Performance Configurations End user training End users Hardware Inference Large language models Modules Operators Optimization Optimization techniques Pipelining (computers) Platforms Run time (computers)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Large Language Models (LLMs) have seen great advance in both academia and industry, and their popularity results in numerous open-source frameworks and techniques in accelerating LLM pre-training, fine-tuning, and inference. Training and deploying LLMs are expensive as it requires considerable computing resources and memory, hence many efficient approaches have been developed for improving system pipelines as well as operators. However, the runtime performance can vary significantly across hardware and software stacks, which makes it difficult to choose the best configuration. In this work, we aim to benchmark the performance from both macro and micro perspectives. First, we benchmark the end-to-end performance of pre-training, fine-tuning, and serving LLMs in different sizes , i.e., 7, 13, and 70 billion parameters (7B, 13B, and 70B) on three 8-GPU platforms with and without individual optimization techniques, including ZeRO, quantization, recomputation, FlashAttention. Then, we dive deeper to provide a detailed runtime analysis of the sub-modules, including computing and communication operators in LLMs. For end users, our benchmark and findings help better understand different optimization techniques, training and inference frameworks, together with hardware platforms in choosing configurations for deploying LLMs. For researchers, our in-depth module-wise analyses discover potential opportunities for future work to further optimize the runtime performance of LLMs.
ISSN:	2331-8422
DOI:	10.48550/arxiv.2311.03687