SGEITL: Scene Graph Enhanced Image-Text Learning for Visual Commonsense Reasoning

Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense...

Full description

Saved in:
Bibliographic Details
Published inarXiv.org
Main Authors Wang, Zhecan, You, Haoxuan, Li, Liunian Harold, Zareian, Alireza, Park, Suji, Liang, Yiqing, Kai-Wei, Chang, Shih-Fu, Chang
Format Paper
LanguageEnglish
Published Ithaca Cornell University Library, arXiv.org 16.12.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Answering complex questions about images is an ambitious goal for machine intelligence, which requires a joint understanding of images, text, and commonsense knowledge, as well as a strong reasoning ability. Recently, multimodal Transformers have made great progress in the task of Visual Commonsense Reasoning (VCR), by jointly understanding visual objects and text tokens through layers of cross-modality attention. However, these approaches do not utilize the rich structure of the scene and the interactions between objects which are essential in answering complex commonsense questions. We propose a Scene Graph Enhanced Image-Text Learning (SGEITL) framework to incorporate visual scene graphs in commonsense reasoning. To exploit the scene graph structure, at the model structure level, we propose a multihop graph transformer for regularizing attention interaction among hops. As for pre-training, a scene-graph-aware pre-training method is proposed to leverage structure knowledge extracted in the visual scene graph. Moreover, we introduce a method to train and generate domain-relevant visual scene graphs using textual annotations in a weakly-supervised manner. Extensive experiments on VCR and other tasks show a significant performance boost compared with the state-of-the-art methods and prove the efficacy of each proposed component.
ISSN:2331-8422