QCQA: Quality and Capacity-aware grouped Query Attention

Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Joshi, Vinay, Laddha, Prashant, Sinha, Shambhavi, Om Ji Omer, Sreenivas Subramoney
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 08.06.2024
Subjects	Accuracy Attention Evolutionary algorithms Inference Large language models Queries Tradeoffs
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function. We demonstrate that QCQA achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA. For the Llama2 \(7\,\)B model, QCQA achieves \(\mathbf{20}\)\% higher accuracy than GQA with similar KV-cache size requirements in the absence of fine-tuning. After fine-tuning both QCQA and GQA, for a similar KV-cache size, QCQA provides \(\mathbf{10.55}\,\)\% higher accuracy than GQA. Furthermore, QCQA requires \(40\,\)\% less KV-cache size than GQA to attain similar accuracy. The proposed quality and capacity-aware grouping of query heads can serve as a new paradigm for KV-cache optimization in autoregressive LLM inference.
ISSN:	2331-8422