Human Attention in Visual Question Answering: Do Humans and Deep Networks Look at the Same Regions?

•Multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (Antol et al., 2015).•Qualitative and quantitative comparison of the maps generated by state-of-the-art attention-based VQA models (Yang et...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 163; pp. 90 - 100
Main Authors	Das, Abhishek, Agrawal, Harsh, Zitnick, Larry, Parikh, Devi, Batra, Dhruv
Format	Journal Article
Language	English
Published	Elsevier Inc 01.10.2017
Subjects	Attention Visual Question Answering Visual Question Answering Attention
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•Multiple game-inspired novel interfaces for collecting human attention maps of where humans choose to look to answer questions from the large-scale VQA dataset (Antol et al., 2015).•Qualitative and quantitative comparison of the maps generated by state-of-the-art attention-based VQA models (Yang et al., 2016; Lu et al., 2016) and a task-independent saliency baseline. (Judd et al., 2009) against our human attention maps through visualizations and rank-order correlation•VQA model trained with explicit supervision for attention using our human attention maps as ground truth. We conduct large-scale studies on ‘human attention’ in Visual Question Answering (VQA) to understand where humans choose to look to answer questions about images. We design and test multiple game-inspired novel attention-annotation interfaces that require the subject to sharpen regions of a blurred image to answer a question. Thus, we introduce the VQA-HAT (Human ATtention) dataset. We evaluate attention maps generated by state-of-the-art VQA models against human attention both qualitatively (via visualizations) and quantitatively (via rank-order correlation). Our experiments show that current attention models in VQA do not seem to be looking at the same regions as humans. Finally, we train VQA models with explicit attention supervision, and find that it improves VQA performance.
ISSN:	1077-3142 1090-235X
DOI:	10.1016/j.cviu.2017.10.001