Multi-Level Knowledge Injecting for Visual Commonsense Reasoning

When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 31; no. 3; pp. 1042 - 1054
Main Authors	Wen, Zhang, Peng, Yuxin
Format	Journal Article
Language	English
Published	New York IEEE 01.03.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Cognition Image recognition Information transfer Knowledge Knowledge based systems Knowledge discovery Knowledge management knowledge representation Object recognition Reasoning Task analysis transfer learning Visual commonsense reasoning visual question answering Visual tasks Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	When glancing at an image, human can infer what is hidden in the image beyond what is visually obvious, such as objects' functions, people's intents and mental states. However, such a visual reasoning paradigm is tremendously difficult for computer, requiring knowledge about how the world works. To address this issue, we propose Commonsense Knowledge based Reasoning Model (CKRM) to acquire external knowledge to support Visual Commonsense Reasoning (VCR) task, where the computer is expected to answer challenging visual questions. Our key ideas are: (1) To bridge the gap between recognition-level and cognition-level image understanding, we inject external commonsense knowledge via multi-level knowledge transfer network , achieving cell-level, layer-level and attention-level joint information transfer. It can effectively capture knowledge from different perspectives, and perceive common sense of human in advance. (2) To further promote image understanding at cognitive level, we propose a knowledge based reasoning approach , which can relate the transferred knowledge to visual content and compose the reasoning cues to derive the final answer. Experiments are conducted on the challenging visual commonsense reasoning dataset VCR to verify the effectiveness of our proposed CKRM approach, which can significantly improve reasoning performance and achieve the state-of-the-art accuracy.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2020.2991866