Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

•Multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (Antol et al., 2015).•Qualitative and quantitative comparison of the maps generated by state-of-the-art attention-based VQA models (Yang et...

Full description

Saved in:
Bibliographic Details
Published inComputer vision and image understanding Vol. 163; pp. 90 - 100
Main Authors Das, Abhishek, Agrawal, Harsh, Zitnick, Larry, Parikh, Devi, Batra, Dhruv
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.10.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•Multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (Antol et al., 2015).•Qualitative and quantitative comparison of the maps generated by state-of-the-art attention-based VQA models (Yang et al., 2016; Lu et al., 2016) and a task-independent saliency baseline. (Judd et al., 2009) against our human attention maps through visualizations and rank-order correlation•VQA model trained with explicit supervision for attention using our human attention maps as ground truth. We conduct large-scale studies on ‘human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans. Finally, we train VQA models with explicit attention supervision, and find that it improves VQA performance.
ISSN:1077-3142
1090-235X
DOI:10.1016/j.cviu.2017.10.001