Efficient Inference of Vision Instruction-Following Models with Elastic Cache

In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails t...

Full description

Saved in:
Bibliographic Details
Main Authors Liu, Zuyan, Liu, Benlin, Wang, Jiahui, Dong, Yuhao, Chen, Guangyi, Rao, Yongming, Krishna, Ranjay, Lu, Jiwen
Format Journal Article
LanguageEnglish
Published 25.07.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an importance-driven cache merging strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their distance with an offset, by which both the initial and most recent tokens are retained. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks. Code is available at https://github.com/liuzuyan/ElasticCache
AbstractList In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high memory demands of their key-value (KV) caches. Conventional cache management strategies for LLMs focus on cache eviction, which often fails to address the specific needs of multimodal instruction-following models. Recognizing this gap, in this paper, we introduce Elastic Cache, a novel approach that benefits from applying distinct acceleration methods for instruction encoding and output generation stages. We investigate the metrics of importance in different stages and propose an importance-driven cache merging strategy to prune redundancy caches. Instead of discarding less important caches, our strategy identifies important key/value vectors as anchor points. Surrounding less important caches are then merged with these anchors, enhancing the preservation of contextual information in the KV caches while yielding an arbitrary acceleration ratio. For instruction encoding, we utilize the frequency to evaluate the importance of caches. Regarding output generation, we prioritize tokens based on their distance with an offset, by which both the initial and most recent tokens are retained. Results on a range of LVLMs demonstrate that Elastic Cache not only boosts efficiency but also notably outperforms existing pruning methods in language generation across various tasks. Code is available at https://github.com/liuzuyan/ElasticCache
Author Dong, Yuhao
Wang, Jiahui
Rao, Yongming
Lu, Jiwen
Chen, Guangyi
Liu, Benlin
Liu, Zuyan
Krishna, Ranjay
Author_xml – sequence: 1
  givenname: Zuyan
  surname: Liu
  fullname: Liu, Zuyan
– sequence: 2
  givenname: Benlin
  surname: Liu
  fullname: Liu, Benlin
– sequence: 3
  givenname: Jiahui
  surname: Wang
  fullname: Wang, Jiahui
– sequence: 4
  givenname: Yuhao
  surname: Dong
  fullname: Dong, Yuhao
– sequence: 5
  givenname: Guangyi
  surname: Chen
  fullname: Chen, Guangyi
– sequence: 6
  givenname: Yongming
  surname: Rao
  fullname: Rao, Yongming
– sequence: 7
  givenname: Ranjay
  surname: Krishna
  fullname: Krishna, Ranjay
– sequence: 8
  givenname: Jiwen
  surname: Lu
  fullname: Lu, Jiwen
BackLink https://doi.org/10.48550/arXiv.2407.18121$$DView paper in arXiv
BookMark eNqFjr0OgjAUhTvo4N8DONkXAClCZCcQHdiMK2nqrdyk3pq2ir69SNydzsnJl5xvziZkCRhbiyTOijxPttK98BmnWbKPRSFSMWNNpTUqBAr8SBockAJuNT-jR0vD5oN7qDD0qLbG2B7pyht7AeN5j6HjlZE-oOKlVB0s2VRL42H1ywXb1NWpPETjcXt3eJPu3X4F2lFg95_4AIpTPS8
ContentType Journal Article
Copyright http://arxiv.org/licenses/nonexclusive-distrib/1.0
Copyright_xml – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0
DBID AKY
GOX
DOI 10.48550/arxiv.2407.18121
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2407_18121
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2407_181213
IEDL.DBID GOX
IngestDate Sat Jul 27 12:10:33 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2407_181213
OpenAccessLink https://arxiv.org/abs/2407.18121
ParticipantIDs arxiv_primary_2407_18121
PublicationCentury 2000
PublicationDate 2024-07-25
PublicationDateYYYYMMDD 2024-07-25
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-07-25
  day: 25
PublicationDecade 2020
PublicationYear 2024
Score 3.864804
SecondaryResourceType preprint
Snippet In the field of instruction-following large vision-language models (LVLMs), the efficient deployment of these models faces challenges, notably due to the high...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computer Vision and Pattern Recognition
Title Efficient Inference of Vision Instruction-Following Models with Elastic Cache
URI https://arxiv.org/abs/2407.18121
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwNBDA21Jy-iVKnfOXgdrdPuTq9Sdm0LrReVvS07mwkIYsVK9eebzLbUS68zIYQJ5CWZyRuAG-_ssHJkDfdTbwYhSc2QODWS6jrmOiHrdFB4Nk_HL4NpkRQtwM0sTPX1-7Zq-IH98k7LjVvFIKlv9qzVJ1uPT0VzORmpuNbyWznJMePSP5DID-Fgnd3hQ-OOI2iFjw7MskjTINEdJ5vxOlwwvsaxbpxsOVxNLl5Z_AiaoP5R9r5EbZNiJhmuKMSRki8fw3WePY_GJhpQfjZsEaXaVkbb-ifQlpo-dAF7NfXqyjsneCslA3tLdE8Snigws6NT6O7ScrZ76xz2rWCuth5tcgFtsT5cCmZ--6t4cH-Nw3H7
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Efficient+Inference+of+Vision+Instruction-Following+Models+with+Elastic+Cache&rft.au=Liu%2C+Zuyan&rft.au=Liu%2C+Benlin&rft.au=Wang%2C+Jiahui&rft.au=Dong%2C+Yuhao&rft.date=2024-07-25&rft_id=info:doi/10.48550%2Farxiv.2407.18121&rft.externalDocID=2407_18121