Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning?
Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , , |
Format | Paper |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
25.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance. |
---|---|
AbstractList | Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation strategy for new ones. But can VLMs learn novel concepts purely from visual demonstrations, or are they limited to adapting to the output format of ICL examples? We propose a new benchmark we call Spatial Visual Ambiguity Tasks (SVAT) that challenges state-of-the-art VLMs to learn new visuospatial tasks in-context. We find that VLMs fail to do this zero-shot, and sometimes continue to fail after finetuning. However, adding simpler data to the training by curriculum learning leads to improved ICL performance. |
Author | Bowen, Zhao Leo Parker Dirac Varshavskaya, Paulina |
Author_xml | – sequence: 1 givenname: Zhao surname: Bowen fullname: Bowen, Zhao – sequence: 2 fullname: Leo Parker Dirac – sequence: 3 givenname: Paulina surname: Varshavskaya fullname: Varshavskaya, Paulina |
BookMark | eNqNyssKwjAQheEgCtbLOwy4FppEW12JVMWFblTcuCgR09LSztRM8_5W8AFcHfjPNxJ9JLQ9ESit5Xy1UGoopsxlGIYqitVyqQPxSAzCveCCEE4Gc29yC2d62YrhZI1DyBzVX-FNBTtbE3LrTNt5BspgWz-L3JNnuDZd7czFGiYsMN9MxCAzFdvpb8didtjfkuO8cfT2ltu0JO-wu1ItZSiliqO1_k99ALnlRHY |
ContentType | Paper |
Copyright | 2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: 2024. This work is published under http://creativecommons.org/licenses/by-sa/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central ProQuest Central Essentials ProQuest Central Technology Collection ProQuest One Community College ProQuest Central Korea SciTech Premium Collection ProQuest Engineering Collection Engineering Database Publicly Available Content Database ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection |
DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest One Academic Engineering Collection |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 2331-8422 |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PIMPY PQEST PQQKQ PQUKI PRINS PTHSS |
ID | FETCH-proquest_journals_31101127693 |
IEDL.DBID | BENPR |
IngestDate | Thu Oct 10 21:13:43 EDT 2024 |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-proquest_journals_31101127693 |
OpenAccessLink | https://www.proquest.com/docview/3110112769?pq-origsite=%requestingapplication% |
PQID | 3110112769 |
PQPubID | 2050157 |
ParticipantIDs | proquest_journals_3110112769 |
PublicationCentury | 2000 |
PublicationDate | 20240925 |
PublicationDateYYYYMMDD | 2024-09-25 |
PublicationDate_xml | – month: 09 year: 2024 text: 20240925 day: 25 |
PublicationDecade | 2020 |
PublicationPlace | Ithaca |
PublicationPlace_xml | – name: Ithaca |
PublicationTitle | arXiv.org |
PublicationYear | 2024 |
Publisher | Cornell University Library, arXiv.org |
Publisher_xml | – name: Cornell University Library, arXiv.org |
SSID | ssj0002672553 |
Score | 3.561857 |
SecondaryResourceType | preprint |
Snippet | Large vision-language models (VLMs) have become state-of-the-art for many computer vision tasks, with in-context learning (ICL) as a popular adaptation... |
SourceID | proquest |
SourceType | Aggregation Database |
SubjectTerms | Computer vision Context Learning Visual tasks |
Title | Can Vision Language Models Learn from Visual Demonstrations of Ambiguous Spatial Reasoning? |
URI | https://www.proquest.com/docview/3110112769 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LSwMxEB7sLoI3n_ioJaDX4Jpskt1T0bprEVtKUSl4KNndrBT6sule_e0mcasHoceQEPIY5pv5MpMBuOaKlEKGBBdxQXFISoYzpnIcUiICGZOIxzYbudfn3dfwacRGNeGm67DKjU50irpY5JYjv6EGp4xtIHjcXn5iWzXKvq7WJTQa4BPjKQQe-PdJfzD8ZVkIF8Zmpv8UrUOPdB_8gVyq1QHsqPkh7Lqgy1wfwXtHztGbS-5GzzVviGxxsqlG7t9TZJM_7IhKTtGDmllb7ufGNFqU6G6WTT4q47ojW1jYCBIaKqkdv9o-hqs0eel08WZF41pq9Phvj_QEPOP-q1NAtLjlBTdAHLAilIJmZZQxkQdlmcchU9EZNLfNdL69-wL2iIFpGwFBWBO89apSlwZm11kLGlH62KpP1LR6X8k3RGSIUg |
link.rule.ids | 783,787,12777,21400,33385,33756,43612,43817 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV1LSwMxEB60RfTmEx9VA3oNrnnunopU11W3RaRKwUPJ7iYi9GW3-_9N4lYPQs8JIS_mm_kyXwbgUmhipGIEF1FBMSOG44zrHDNKZKAiEorIqZG7PZG8sscBH9SEW1mnVS5tojfUxTR3HPkVtThlfQMpovbsC7uqUe51tS6hsQ5NRi1WO6V4fP_LsRAhrcdM_5lZjx3xNjSf1UzPd2BNT3Zhw6dc5uUevHfUBL15aTdKa9YQudJkoxL5X0-Rk364HpUaoVs9dp7cz3mVaGrQzTj7_Khs4I5cWWF7jdCLVqVnV9v7cBHf9TsJXs5oWN-Zcvi3QnoADRv860NAtLgWhbAwHPCCKUkzE2Zc5oExecS4Do-gtWqk49XN57CZ9LvpMH3oPZ3AFrGA7XIhCG9BYzGv9KkF3EV25nf1GxqHh8Y |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Can+Vision+Language+Models+Learn+from+Visual+Demonstrations+of+Ambiguous+Spatial+Reasoning%3F&rft.jtitle=arXiv.org&rft.au=Bowen%2C+Zhao&rft.au=Leo+Parker+Dirac&rft.au=Varshavskaya%2C+Paulina&rft.date=2024-09-25&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422 |