Leftover: Improving Large Language Model Inference Efficiency by Leveraging Idle Resources

Large language models and other deep learning models exist in many application areas that have large demands on computing resources but do not have strict real-time response requirements. While the recent algorithmic innovations have primarily focused on optimizing inference latency for large langua...

Full description

Saved in:

Bibliographic Details
Published in	2023 International Conference on High Performance Big Data and Intelligent Systems (HDIS) pp. 60 - 65
Main Authors	Duan, Xu, Ye, Kejiang
Format	Conference Proceeding
Language	English
Published	IEEE 06.12.2023
Subjects	Computational modeling Large Language Model Inference Performance evaluation Preemptive Inference Processor scheduling Real-time systems Resource management Resource Utilization Technological innovation Throughput
Online Access	Get full text
DOI	10.1109/HDIS60872.2023.10499636

Cover

Loading…

More Information
Summary:	Large language models and other deep learning models exist in many application areas that have large demands on computing resources but do not have strict real-time response requirements. While the recent algorithmic innovations have primarily focused on optimizing inference latency for large language models, without considering the throughput of inference tasks. On the other hand, data centers often host many underutilized idle resources or offer cost-effective preemptible instances, which can be used by the inference tasks to improve the inference efficiency. Thus, in this paper, we introduce Leftover, a general-purpose large language model inference system that encompasses model compilation, deployment, and task scheduling infrastructure. Leftover leverages idle or preemptible resources to handle inference tasks that are insensitive to latency but require substantial computational power, leading to significant improvements in cluster computing performance. We evaluate Leftover with real-world workloads and simulated preemptive experiments, achieving up to an 11.28x increase in resource utilization compared to baseline methods and a 1.45x performance improvement compared to basic preemptive inference approaches.
DOI:	10.1109/HDIS60872.2023.10499636