Caption-Driven Explorations: Aligning Image and Text Embeddings through Human-Inspired Foveated Vision
Understanding human attention is crucial for vision science and AI. While many models exist for free-viewing, less is known about task-driven image exploration. To address this, we introduce CapMIT1003, a dataset with captions and click-contingent image explorations, to study human attention during...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
19.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Understanding human attention is crucial for vision science and AI. While
many models exist for free-viewing, less is known about task-driven image
exploration. To address this, we introduce CapMIT1003, a dataset with captions
and click-contingent image explorations, to study human attention during the
captioning task. We also present NevaClip, a zero-shot method for predicting
visual scanpaths by combining CLIP models with NeVA algorithms. NevaClip
generates fixations to align the representations of foveated visual stimuli and
captions. The simulated scanpaths outperform existing human attention models in
plausibility for captioning and free-viewing tasks. This research enhances the
understanding of human attention and advances scanpath prediction models. |
---|---|
DOI: | 10.48550/arxiv.2408.09948 |