Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?

Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Bowen, Zhao, Leo Parker Dirac, Varshavskaya, Paulina
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 25.09.2024
Subjects
Online AccessGet full text

Cover

Loading…