Learning Harmonized Representations for Speculative Sampling
Speculative sampling is a promising approach to accelerate the decoding stage for Large Language Models (LLMs). Recent advancements that leverage target LLM's contextual information, such as hidden states and KV cache, have shown significant practical improvements. However, these approaches suf...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
28.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Speculative sampling is a promising approach to accelerate the decoding stage
for Large Language Models (LLMs). Recent advancements that leverage target
LLM's contextual information, such as hidden states and KV cache, have shown
significant practical improvements. However, these approaches suffer from
inconsistent context between training and decoding. We also observe another
discrepancy between the training and decoding objectives in existing
speculative sampling methods. In this work, we propose a solution named
HArmonized Speculative Sampling (HASS) that learns harmonized representations
to address these issues. HASS accelerates the decoding stage without adding
inference overhead through harmonized objective distillation and harmonized
context alignment. Experiments on four LLaMA models demonstrate that HASS
achieves 2.81x-4.05x wall-clock time speedup ratio averaging across three
datasets, surpassing EAGLE-2 by 8%-20%. |
---|---|
DOI: | 10.48550/arxiv.2408.15766 |