MMICL: Empowering Vision-language Model with Multi-Modal In-Context Learning
Since the resurgence of deep learning, vision-language models (VLMs) enhanced by large language models (LLMs) have grown exponentially in popularity. However, while LLMs can utilize extensive background knowledge and task information with in-context learning, most VLMs still struggle with understand...
Saved in:
Main Authors | , , , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
14.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Since the resurgence of deep learning, vision-language models (VLMs) enhanced
by large language models (LLMs) have grown exponentially in popularity.
However, while LLMs can utilize extensive background knowledge and task
information with in-context learning, most VLMs still struggle with
understanding complex multi-modal prompts with multiple images, making VLMs
less effective in downstream vision-language tasks. In this paper, we address
the limitation above by 1) introducing vision-language Model with Multi-Modal
In-Context Learning(MMICL), a new approach to allow the VLM to deal with
multi-modal inputs efficiently; 2) proposing a novel context scheme to augment
the in-context learning ability of the VLM; 3) constructing the Multi-modal
In-Context Learning (MIC) dataset, designed to enhance the VLM's ability to
understand complex multi-modal prompts. Our experiments confirm that MMICL
achieves new state-of-the-art zero-shot performance on a wide range of general
vision-language tasks, especially for complex benchmarks, including MME and
MMBench. Our analysis demonstrates that MMICL effectively tackles the challenge
of complex multi-modal prompt understanding and emerges the impressive ICL
ability. Furthermore, we observe that MMICL successfully alleviates language
bias in VLMs, a common issue for VLMs that often leads to hallucination when
faced with extensive textual context. Our code, dataset, dataset tool, and
model are available at https://github.com/PKUnlp-icler/MIC |
---|---|
DOI: | 10.48550/arxiv.2309.07915 |