MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understand...
Saved in:
Main Authors | , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Since the resurgence of deep learning, vision-language models (VLMs) enhanced
by large language models (LLMs) have grown exponentially in popularity.
However, while LLMs can utilize extensive background knowledge and task
information with in-context learning, most VLMs still struggle with
understanding complex multi-modal prompts with multiple images, making VLMs
less effective in downstream vision-language tasks. In this paper, we address
the limitation above by 1) introducing vision-language Model with Multi-Modal
In-Context Learning(MMICL), a new approach to allow the VLM to deal with
multi-modal inputs efficiently; 2) proposing a novel context scheme to augment
the in-context learning ability of the VLM; 3) constructing the Multi-modal
In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to
understand complex multi-modal prompts. Our experiments confirm that MMICL
achieves new state-of-the-art zero-shot performance on a wide range of general
vision-language tasks, especially for complex benchmarks, including MME and
MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge
of complex multi-modal prompt understanding and emerges the impressive ICL
ability. Furthermore, we observe that MMICL successfully alleviates language
bias in VLMs, a common issue for VLMs that often leads to hallucination when
faced with extensive textual context. Our code, dataset, dataset tool, and
model are available at https://github.com/PKUnlp-icler/MIC |
---|---|
AbstractList | Since the resurgence of deep learning, vision-language models (VLMs) enhanced
by large language models (LLMs) have grown exponentially in popularity.
However, while LLMs can utilize extensive background knowledge and task
information with in-context learning, most VLMs still struggle with
understanding complex multi-modal prompts with multiple images, making VLMs
less effective in downstream vision-language tasks. In this paper, we address
the limitation above by 1) introducing vision-language Model with Multi-Modal
In-Context Learning(MMICL), a new approach to allow the VLM to deal with
multi-modal inputs efficiently; 2) proposing a novel context scheme to augment
the in-context learning ability of the VLM; 3) constructing the Multi-modal
In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to
understand complex multi-modal prompts. Our experiments confirm that MMICL
achieves new state-of-the-art zero-shot performance on a wide range of general
vision-language tasks, especially for complex benchmarks, including MME and
MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge
of complex multi-modal prompt understanding and emerges the impressive ICL
ability. Furthermore, we observe that MMICL successfully alleviates language
bias in VLMs, a common issue for VLMs that often leads to hallucination when
faced with extensive textual context. Our code, dataset, dataset tool, and
model are available at https://github.com/PKUnlp-icler/MIC |
Author | Si, Shuzheng Han, Wenjuan Chang, Baobao Chen, Liang Cai, Zefan An, Kaikai Zhao, Haozhe Ma, Xiaojian Liu, Zixuan Wang, Sheng |
Author_xml | – sequence: 1 givenname: Haozhe surname: Zhao fullname: Zhao, Haozhe – sequence: 2 givenname: Zefan surname: Cai fullname: Cai, Zefan – sequence: 3 givenname: Shuzheng surname: Si fullname: Si, Shuzheng – sequence: 4 givenname: Xiaojian surname: Ma fullname: Ma, Xiaojian – sequence: 5 givenname: Kaikai surname: An fullname: An, Kaikai – sequence: 6 givenname: Liang surname: Chen fullname: Chen, Liang – sequence: 7 givenname: Zixuan surname: Liu fullname: Liu, Zixuan – sequence: 8 givenname: Sheng surname: Wang fullname: Wang, Sheng – sequence: 9 givenname: Wenjuan surname: Han fullname: Han, Wenjuan – sequence: 10 givenname: Baobao surname: Chang fullname: Chang, Baobao |
BackLink | https://doi.org/10.48550/arXiv.2309.07915$$DView paper in arXiv |
BookMark | eNotz7FOwzAYBGAPMJTCA3TCL-Bg54_jhA1FBSIlYqlYo9-JHSylduWmtLw9pWU66aQ76bsjNz54Q8hK8CQrpORPGE_uO0mBlwlXpZAL0rRtXTXPdL3dhaOJzo_00-1d8GxCPx5wNLQNg5no0c1ftD1Ms2PnAidae1YFP5vTTBuD0Z-n9-TW4rQ3D_-5JJvX9aZ6Z83HW129NAxzJVlW5lamRnNrIeO5RG1RCwsaegSlCkA1AE-54NAXKpe91MPApdKQ4qB7AUvyeL29cLpddFuMP90fq7uw4BcBU0k3 |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by-sa/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by-sa/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2309.07915 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2309_07915 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a675-496f52eb0ff34065abfab1f3b3ca37783a7d3020103c8765c5bdd057b32adbc13 |
IEDL.DBID | GOX |
IngestDate | Fri Mar 22 12:23:18 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a675-496f52eb0ff34065abfab1f3b3ca37783a7d3020103c8765c5bdd057b32adbc13 |
OpenAccessLink | https://arxiv.org/abs/2309.07915 |
ParticipantIDs | arxiv_primary_2309_07915 |
PublicationCentury | 2000 |
PublicationDate | 2023-09-14 |
PublicationDateYYYYMMDD | 2023-09-14 |
PublicationDate_xml | – month: 09 year: 2023 text: 2023-09-14 day: 14 |
PublicationDecade | 2020 |
PublicationYear | 2023 |
Score | 1.8984619 |
SecondaryResourceType | preprint |
Snippet | Since the resurgence of deep learning, vision-language models (VLMs) enhanced
by large language models (LLMs) have grown exponentially in popularity.
However,... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Artificial Intelligence Computer Science - Computation and Language Computer Science - Computer Vision and Pattern Recognition |
Title | MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning |
URI | https://arxiv.org/abs/2309.07915 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1NSwMxEA21Jy-iqNRPcvAa3d3ZNBtvIq1VWr1U6a1kko0IupZ2lf58J9ktevGahEAmkPfIzHvD2IVWxgLqQlgnpcgzZYT2OhHEHHJfGGnRBe3w5LE_es4fZnLWYXyjhTHL9dt34w-Mqyvix_oyUTqoyLeyLJRs3T3NmuRktOJq1_-uI44Zh_6AxHCX7bTsjt8017HHOmW1z8YTOur4mg8-FqElGWEFf4mKbrH5LOShI9k7D3-iPCpiBQ3QPveViO5R65q3PqivB2w6HExvR6JtYiAMcXGR676XWYmJ90DYKQ16g6kHBGtAqQKMcpCElDRYepiklegccSiEzDi0KRyybvVZlT3GoUiM7UMJmStz6VJUNjUELWkRPLUsHrFePPp80fhUzENU5jEqx_9PnbDt0EE9lECk-Snr1suv8oxwtsbzGOwf9k19Kw |
link.rule.ids | 228,230,783,888 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=MMICL%3A+Empowering+Vision-language+Model+with+Multi-Modal+In-Context+Learning&rft.au=Zhao%2C+Haozhe&rft.au=Cai%2C+Zefan&rft.au=Si%2C+Shuzheng&rft.au=Ma%2C+Xiaojian&rft.date=2023-09-14&rft_id=info:doi/10.48550%2Farxiv.2309.07915&rft.externalDocID=2309_07915 |