CommVQA: Situating Visual Question Answering in Communicative Contexts
Current visual question answering (VQA) models tend to be trained and evaluated on image-question pairs in isolation. However, the questions people ask are dependent on their informational needs and prior knowledge about the image content. To evaluate how situating images within naturalistic context...
Saved in:
Main Authors | , , |
---|---|
Format | Journal Article |
Language | English |
Published |
22.02.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Current visual question answering (VQA) models tend to be trained and
evaluated on image-question pairs in isolation. However, the questions people
ask are dependent on their informational needs and prior knowledge about the
image content. To evaluate how situating images within naturalistic contexts
shapes visual questions, we introduce CommVQA, a VQA dataset consisting of
images, image descriptions, real-world communicative scenarios where the image
might appear (e.g., a travel website), and follow-up questions and answers
conditioned on the scenario and description. CommVQA, which contains 1000
images and 8,949 question-answer pairs, poses a challenge for current models.
Error analyses and a human-subjects study suggest that generated answers still
contain high rates of hallucinations, fail to fittingly address unanswerable
questions, and don't suitably reflect contextual information. Overall, we show
that access to contextual information is essential for solving CommVQA, leading
to the highest performing VQA model and highlighting the relevance of situating
systems within communicative scenarios. |
---|---|
DOI: | 10.48550/arxiv.2402.15002 |