Leftover: Improving Large Language Model Inference Efficiency by Leveraging Idle Resources
Large language models and other deep learning models exist in many application areas that have large demands on computing resources but do not have strict real-time response requirements. While the recent algorithmic innovations have primarily focused on optimizing inference latency for large langua...
Saved in:
Published in | 2023 International Conference on High Performance Big Data and Intelligent Systems (HDIS) pp. 60 - 65 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
06.12.2023
|
Subjects | |
Online Access | Get full text |
DOI | 10.1109/HDIS60872.2023.10499636 |
Cover
Loading…
Summary: | Large language models and other deep learning models exist in many application areas that have large demands on computing resources but do not have strict real-time response requirements. While the recent algorithmic innovations have primarily focused on optimizing inference latency for large language models, without considering the throughput of inference tasks. On the other hand, data centers often host many underutilized idle resources or offer cost-effective preemptible instances, which can be used by the inference tasks to improve the inference efficiency. Thus, in this paper, we introduce Leftover, a general-purpose large language model inference system that encompasses model compilation, deployment, and task scheduling infrastructure. Leftover leverages idle or preemptible resources to handle inference tasks that are insensitive to latency but require substantial computational power, leading to significant improvements in cluster computing performance. We evaluate Leftover with real-world workloads and simulated preemptive experiments, achieving up to an 11.28x increase in resource utilization compared to baseline methods and a 1.45x performance improvement compared to basic preemptive inference approaches. |
---|---|
DOI: | 10.1109/HDIS60872.2023.10499636 |