VGDiffZero: Text-to-image Diffusion Models Can Be Zero-shot Visual Grounders

Large-scale text-to-image diffusion models have shown impressive capabilities for generative tasks by leveraging strong vision-language alignment from pre-training. However, most vision-language discriminative tasks require extensive fine-tuning on carefully-labeled datasets to acquire such alignmen...

Full description

Saved in:
Bibliographic Details
Main Authors Liu, Xuyang, Huang, Siteng, Kang, Yachen, Chen, Honggang, Wang, Donglin
Format Journal Article
LanguageEnglish
Published 03.09.2023
Subjects
Online AccessGet full text

Cover

Loading…