VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignmen...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
03.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Large-scale text-to-image diffusion models have shown impressive capabilities
for generative tasks by leveraging strong vision-language alignment from
pre-training. However, most vision-language discriminative tasks require
extensive fine-tuning on carefully-labeled datasets to acquire such alignment,
with great cost in time and computing resources. In this work, we explore
directly applying a pre-trained generative diffusion model to the challenging
discriminative task of visual grounding without any fine-tuning and additional
training dataset. Specifically, we propose VGDiffZero, a simple yet effective
zero-shot visual grounding framework based on text-to-image diffusion models.
We also design a comprehensive region-scoring method considering both global
and local contexts of each isolated proposal. Extensive experiments on RefCOCO,
RefCOCO+, and RefCOCOg show that VGDiffZero achieves strong performance on
zero-shot visual grounding. Our code is available at
https://github.com/xuyang-liu16/VGDiffZero. |
---|---|
DOI: | 10.48550/arxiv.2309.01141 |