Scene Graph Refinement Network for Visual Question Answering

Visual Question Answering aims to answer the free-form natural language question based on the visual clues in a given image. It is a difficult problem as it requires understanding the fine-grained structured information of both language and image for compositional reasoning. To establish the composi...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 25; pp. 3950 - 3961
Main Authors	Qian, Tianwen, Chen, Jingjing, Chen, Shaoxiang, Wu, Bo, Jiang, Yu-Gang
Format	Journal Article
Language	English
Published	Piscataway IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Cognition Cross-modal Learning Effectiveness Feature extraction Free form Graphs Natural language processing Noise measurement Questions Reasoning Scene Graph Semantics Task analysis Transformers Visual Question Answering Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Visual Question Answering aims to answer the free-form natural language question based on the visual clues in a given image. It is a difficult problem as it requires understanding the fine-grained structured information of both language and image for compositional reasoning. To establish the compositional reasoning, recent works attempt to introduce the scene graph in VQA. However, as the generated scene graphs are usually quite noisy, it greatly limits the performance of question answering. Therefore, this paper proposes to refine the scene graphs for improving the effectiveness. Specifically, we present a novel S cene G raph R efinement network ( SGR ), which introduces a transformer-based refinement network to enhance the object and relation features for better classification. Moreover, as the question provides valuable clues for distinguishing whether the <inline-formula><tex-math notation="LaTeX">\left\langle \mathit{subject, predicate, object} \right\rangle</tex-math></inline-formula> triplets are helpful or not, the SGR network exploits the semantic information presented in the questions to select the most relevant relations for question answering. Extensive experiments are conducted on the GQA benchmark demonstrate the effectiveness of our method.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2022.3169065