VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders
Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignmen...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
03.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Be the first to leave a comment!