MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning

Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understand...

Full description

Saved in:
Bibliographic Details
Main Authors Zhao, Haozhe, Cai, Zefan, Si, Shuzheng, Ma, Xiaojian, An, Kaikai, Chen, Liang, Liu, Zixuan, Wang, Sheng, Han, Wenjuan, Chang, Baobao
Format Journal Article
LanguageEnglish
Published 14.09.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC
AbstractList Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understanding complex multi-modal prompts with multiple images, making VLMs less effective in downstream vision-language tasks. In this paper, we address the limitation above by 1) introducing vision-language Model with Multi-Modal In-Context Learning(MMICL), a new approach to allow the VLM to deal with multi-modal inputs efficiently; 2) proposing a novel context scheme to augment the in-context learning ability of the VLM; 3) constructing the Multi-modal In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to understand complex multi-modal prompts. Our experiments confirm that MMICL achieves new state-of-the-art zero-shot performance on a wide range of general vision-language tasks, especially for complex benchmarks, including MME and MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge of complex multi-modal prompt understanding and emerges the impressive ICL ability. Furthermore, we observe that MMICL successfully alleviates language bias in VLMs, a common issue for VLMs that often leads to hallucination when faced with extensive textual context. Our code, dataset, dataset tool, and model are available at https://github.com/PKUnlp-icler/MIC
Author Si, Shuzheng
Han, Wenjuan
Chang, Baobao
Chen, Liang
Cai, Zefan
An, Kaikai
Zhao, Haozhe
Ma, Xiaojian
Liu, Zixuan
Wang, Sheng
Author_xml – sequence: 1
  givenname: Haozhe
  surname: Zhao
  fullname: Zhao, Haozhe
– sequence: 2
  givenname: Zefan
  surname: Cai
  fullname: Cai, Zefan
– sequence: 3
  givenname: Shuzheng
  surname: Si
  fullname: Si, Shuzheng
– sequence: 4
  givenname: Xiaojian
  surname: Ma
  fullname: Ma, Xiaojian
– sequence: 5
  givenname: Kaikai
  surname: An
  fullname: An, Kaikai
– sequence: 6
  givenname: Liang
  surname: Chen
  fullname: Chen, Liang
– sequence: 7
  givenname: Zixuan
  surname: Liu
  fullname: Liu, Zixuan
– sequence: 8
  givenname: Sheng
  surname: Wang
  fullname: Wang, Sheng
– sequence: 9
  givenname: Wenjuan
  surname: Han
  fullname: Han, Wenjuan
– sequence: 10
  givenname: Baobao
  surname: Chang
  fullname: Chang, Baobao
BackLink https://doi.org/10.48550/arXiv.2309.07915$$DView paper in arXiv
BookMark eNotz7FOwzAYBGAPMJTCA3TCL-Bg54_jhA1FBSIlYqlYo9-JHSylduWmtLw9pWU66aQ76bsjNz54Q8hK8CQrpORPGE_uO0mBlwlXpZAL0rRtXTXPdL3dhaOJzo_00-1d8GxCPx5wNLQNg5no0c1ftD1Ms2PnAidae1YFP5vTTBuD0Z-n9-TW4rQ3D_-5JJvX9aZ6Z83HW129NAxzJVlW5lamRnNrIeO5RG1RCwsaegSlCkA1AE-54NAXKpe91MPApdKQ4qB7AUvyeL29cLpddFuMP90fq7uw4BcBU0k3
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by-sa/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by-sa/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2309.07915
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2309_07915
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a675-496f52eb0ff34065abfab1f3b3ca37783a7d3020103c8765c5bdd057b32adbc13
IEDL.DBID GOX
IngestDate Fri Mar 22 12:23:18 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a675-496f52eb0ff34065abfab1f3b3ca37783a7d3020103c8765c5bdd057b32adbc13
OpenAccessLink https://arxiv.org/abs/2309.07915
ParticipantIDs arxiv_primary_2309_07915
PublicationCentury 2000
PublicationDate 2023-09-14
PublicationDateYYYYMMDD 2023-09-14
PublicationDate_xml – month: 09
  year: 2023
  text: 2023-09-14
  day: 14
PublicationDecade 2020
PublicationYear 2023
Score 1.8984619
SecondaryResourceType preprint
Snippet Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However,...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computation and Language
Computer Science - Computer Vision and Pattern Recognition
Title MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
URI https://arxiv.org/abs/2309.07915
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwMxEA21Jy-iqNRPcvAa3d3ZNBtvIq1VWr1U6a1kko0IupZ2lf58J9ktevGahEAmkPfIzHvD2IVWxgLqQlgnpcgzZYT2OhHEHHJfGGnRBe3w5LE_es4fZnLWYXyjhTHL9dt34w-Mqyvix_oyUTqoyLeyLJRs3T3NmuRktOJq1_-uI44Zh_6AxHCX7bTsjt8017HHOmW1z8YTOur4mg8-FqElGWEFf4mKbrH5LOShI9k7D3-iPCpiBQ3QPveViO5R65q3PqivB2w6HExvR6JtYiAMcXGR676XWYmJ90DYKQ16g6kHBGtAqQKMcpCElDRYepiklegccSiEzDi0KRyybvVZlT3GoUiM7UMJmStz6VJUNjUELWkRPLUsHrFePPp80fhUzENU5jEqx_9PnbDt0EE9lECk-Snr1suv8oxwtsbzGOwf9k19Kw
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MMICL%3A+Empowering+Vision-language+Model+with+Multi-Modal+In-Context+Learning&rft.au=Zhao%2C+Haozhe&rft.au=Cai%2C+Zefan&rft.au=Si%2C+Shuzheng&rft.au=Ma%2C+Xiaojian&rft.date=2023-09-14&rft_id=info:doi/10.48550%2Farxiv.2309.07915&rft.externalDocID=2309_07915