A Good Prompt Is Worth Millions of Parameters: Low-resource Prompt-based Learning for Vision-Language Models
Large pre-trained vision-language (VL) models can learn a new task with a handful of examples and generalize to a new task without fine-tuning. However, these VL models are hard to deploy for real-world applications due to their impractically huge sizes and slow inference speed. To solve this limita...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
16.10.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Large pre-trained vision-language (VL) models can learn a new task with a
handful of examples and generalize to a new task without fine-tuning. However,
these VL models are hard to deploy for real-world applications due to their
impractically huge sizes and slow inference speed. To solve this limitation, we
study prompt-based low-resource learning of VL tasks with our proposed method,
FewVLM, relatively smaller than recent few-shot learners. For FewVLM, we
pre-train a sequence-to-sequence transformer model with prefix language
modeling (PrefixLM) and masked language modeling (MaskedLM). Furthermore, we
analyze the effect of diverse prompts for few-shot tasks. Experimental results
on VQA show that FewVLM with prompt-based learning outperforms Frozen which is
31x larger than FewVLM by 18.2% point and achieves comparable results to a 246x
larger model, PICa. In our analysis, we observe that (1) prompts significantly
affect zero-shot performance but marginally affect few-shot performance, (2)
models with noisy prompts learn as quickly as hand-crafted prompts given larger
training data, and (3) MaskedLM helps VQA tasks while PrefixLM boosts
captioning performance. Our code is publicly available at
\url{https://github.com/woojeongjin/FewVLM} |
---|---|
DOI: | 10.48550/arxiv.2110.08484 |