CV-Probes: Studying the interplay of lexical and world knowledge in visually grounded verb understanding
This study investigates the ability of various vision-language (VL) models to ground context-dependent and non-context-dependent verb phrases. To do that, we introduce the CV-Probes dataset, designed explicitly for studying context understanding, containing image-caption pairs with context-dependent...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
02.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This study investigates the ability of various vision-language (VL) models to
ground context-dependent and non-context-dependent verb phrases. To do that, we
introduce the CV-Probes dataset, designed explicitly for studying context
understanding, containing image-caption pairs with context-dependent verbs
(e.g., "beg") and non-context-dependent verbs (e.g., "sit"). We employ the
MM-SHAP evaluation to assess the contribution of verb tokens towards model
predictions. Our results indicate that VL models struggle to ground
context-dependent verb phrases effectively. These findings highlight the
challenges in training VL models to integrate context accurately, suggesting a
need for improved methodologies in VL model training and evaluation. |
---|---|
DOI: | 10.48550/arxiv.2409.01389 |