Vision-Language-Knowledge Co-Embedding for Visual Commonsense Reasoning

Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the...

Full description

Saved in:

Bibliographic Details
Published in	Sensors (Basel, Switzerland) Vol. 21; no. 9; p. 2911
Main Authors	Lee, JaeYun, Kim, Incheol
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 21.04.2021 MDPI
Subjects	Curricula Datasets graph convolutional network Knowledge acquisition Knowledge bases (artificial intelligence) knowledge graph Knowledge representation Language Linguistics multimodal co-embedding Natural language Natural Language Processing Neural networks Neural Networks, Computer pretrained multi-head self-attention network Problem Solving Reasoning Semantics visual commonsense reasoning graph convolutional network pretrained multi-head self-attention network multimodal co-embedding knowledge graph visual commonsense reasoning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Visual commonsense reasoning is an intelligent task performed to decide the most appropriate answer to a question while providing the rationale or reason for the answer when an image, a natural language question, and candidate responses are given. For effective visual commonsense reasoning, both the knowledge acquisition problem and the multimodal alignment problem need to be solved. Therefore, we propose a novel Vision-Language-Knowledge Co-embedding (ViLaKC) model that extracts knowledge graphs relevant to the question from an external knowledge base, ConceptNet, and uses them together with the input image to answer the question. The proposed model uses a pretrained vision-language-knowledge embedding module, which co-embeds multimodal data including images, natural language texts, and knowledge graphs into a single feature vector. To reflect the structural information of the knowledge graph, the proposed model uses the graph convolutional neural network layer to embed the knowledge graph first and then uses multi-head self-attention layers to co-embed it with the image and natural language question. The effectiveness and performance of the proposed model are experimentally validated using the VCR v1.0 benchmark dataset.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s21092911