Scene Graph Refinement Network for Visual Question Answering

Visual Question Answering aims to answer the free-form natural language question based on the visual clues in a given image. It is a difficult problem as it requires understanding the fine-grained structured information of both language and image for compositional reasoning. To establish the composi...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on multimedia Vol. 25; pp. 3950 - 3961
Main Authors Qian, Tianwen, Chen, Jingjing, Chen, Shaoxiang, Wu, Bo, Jiang, Yu-Gang
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Visual Question Answering aims to answer the free-form natural language question based on the visual clues in a given image. It is a difficult problem as it requires understanding the fine-grained structured information of both language and image for compositional reasoning. To establish the compositional reasoning, recent works attempt to introduce the scene graph in VQA. However, as the generated scene graphs are usually quite noisy, it greatly limits the performance of question answering. Therefore, this paper proposes to refine the scene graphs for improving the effectiveness. Specifically, we present a novel S cene G raph R efinement network ( SGR ), which introduces a transformer-based refinement network to enhance the object and relation features for better classification. Moreover, as the question provides valuable clues for distinguishing whether the <inline-formula><tex-math notation="LaTeX">\left\langle \mathit{subject, predicate, object} \right\rangle</tex-math></inline-formula> triplets are helpful or not, the SGR network exploits the semantic information presented in the questions to select the most relevant relations for question answering. Extensive experiments are conducted on the GQA benchmark demonstrate the effectiveness of our method.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2022.3169065