A Visual Question Answering Network Merging High- and Low-Level Semantic Information

Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information...

Full description

Saved in:

Bibliographic Details
Published in	IEICE Transactions on Information and Systems Vol. E106.D; no. 5; pp. 581 - 589
Main Authors	LI, Huimin, HAN, Dezhi, CHEN, Chongqing, CHANG, Chin-Chen, LI, Kuan-Ching, LI, Dun
Format	Journal Article
Language	English
Published	Tokyo The Institute of Electronics, Information and Communication Engineers 01.05.2023 Japan Science and Technology Agency
Subjects	Ablation adaptive weight learning deep attention mechanisms gate-sum mechanism Questions Semantics Visual Question Answering (VQA)
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev.
ISSN:	0916-8532 1745-1361
DOI:	10.1587/transinf.2022DLP0002