Multimodal feature-wise co-attention method for visual question answering

VQA attracts lots of researchers in recent years. It could be potentially applied to the remote consultation of COVID-19. Attention mechanisms provide an effective way of utilizing visual and question information selectively in visual question and answering (VQA). The attention methods of existing V...

Full description

Saved in:

Bibliographic Details
Published in	Information fusion Vol. 73; pp. 1 - 10
Main Authors	Zhang, Sheng, Chen, Min, Chen, Jincai, Zou, Fuhao, Li, Yuan-Fang, Lu, Ping
Format	Journal Article
Language	English
Published	Elsevier B.V 01.09.2021
Subjects	Deep learning Feature-wise attention learning Multimodal feature fusion Visual question answering (VQA) Deep learning Visual question answering (VQA) Multimodal feature fusion Feature-wise attention learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	VQA attracts lots of researchers in recent years. It could be potentially applied to the remote consultation of COVID-19. Attention mechanisms provide an effective way of utilizing visual and question information selectively in visual question and answering (VQA). The attention methods of existing VQA models generally focus on spatial dimension. In other words, the attention is modeled as spatial probabilities that re-weights the image region or word token features. However, feature-wise attention cannot be ignored, as image and question representations are organized in both spatial and feature-wise modes. Taking the question “What is the color of the woman’s hair” for example, identifying the hair color attribute feature is as important as focusing on the hair region. In this paper, we propose a novel neural network module named “multimodal feature-wise attention module” (MulFA) to model the feature-wise attention. Extensive experiments show that MulFA is capable of filtering representations for feature refinement and leads to improved performance. By introducing MulFA modules, we construct an effective union feature-wise and spatial co-attention network (UFSCAN) model for VQA. Our evaluation on two large-scale VQA datasets, VQA 1.0 and VQA 2.0, shows that UFSCAN achieves performance competitive with state-of-the-art models. •Visual Question and Answering could be potentially applied to the remote consultation of COVID-19.•Feature-wise attention enables obtaining more discriminative representations.•Feature-wise attention is an effective complement for spatial attention.•Feature-wise attention could work in both image and question modalities.•Bilinear model performs better than element-wise multiplication in multimodal fusion.
ISSN:	1566-2535 1872-6305
DOI:	10.1016/j.inffus.2021.02.022