A Visual Question Answering Network Merging High- and Low-Level Semantic Information
Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information...
Saved in:
Published in | IEICE Transactions on Information and Systems Vol. E106.D; no. 5; pp. 581 - 589 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Tokyo
The Institute of Electronics, Information and Communication Engineers
01.05.2023
Japan Science and Technology Agency |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Visual Question Answering (VQA) usually uses deep attention mechanisms to learn fine-grained visual content of images and textual content of questions. However, the deep attention mechanism can only learn high-level semantic information while ignoring the impact of the low-level semantic information on answer prediction. For such, we design a High- and Low-Level Semantic Information Network (HLSIN), which employs two strategies to achieve the fusion of high-level semantic information and low-level semantic information. Adaptive weight learning is taken as the first strategy to allow different levels of semantic information to learn weights separately. The gate-sum mechanism is used as the second to suppress invalid information in various levels of information and fuse valid information. On the benchmark VQA-v2 dataset, we quantitatively and qualitatively evaluate HLSIN and conduct extensive ablation studies to explore the reasons behind HLSIN's effectiveness. Experimental results demonstrate that HLSIN significantly outperforms the previous state-of-the-art, with an overall accuracy of 70.93% on test-dev. |
---|---|
ISSN: | 0916-8532 1745-1361 |
DOI: | 10.1587/transinf.2022DLP0002 |