Context-Aware Graph Inference With Knowledge Distillation for Visual Dialog

Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relational inference in a graphical model with sparse contextual subjects (nodes) and unknown graph structure (relation descriptor);...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on pattern analysis and machine intelligence Vol. 44; no. 10; pp. 6056 - 6073
Main Authors Guo, Dan, Wang, Hui, Wang, Meng
Format Journal Article
LanguageEnglish
Published New York IEEE 01.10.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relational inference in a graphical model with sparse contextual subjects (nodes) and unknown graph structure (relation descriptor); how to model the underlying context-aware relational inference is critical. To this end, we propose a novel context-aware graph (CAG) neural network. We focus on the exploitation of fine-grained relational reasoning with object-level dialog-historical co-reference nodes. The graph structure (relation in dialog) is iteratively updated using an adaptive top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-3085755.gif"/> </inline-formula> message passing mechanism. To eliminate sparse useless relations, each node has dynamic relations in the graph (different related <inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq2-3085755.gif"/> </inline-formula> neighbor nodes), and only the most relevant nodes are attributive to the context-aware relational graph inference. In addition, to avoid negative performance caused by linguistic bias of history, we propose a pure visual-aware knowledge distillation mechanism named CAG-Distill, in which image-only visual clues are used to regularize the joint dialog-historical contextual awareness at the object-level. Experimental results on VisDial v0.9 and v1.0 datasets show that both CAG and CAG-Distill outperform comparative methods. Visualization results further validate the remarkable interpretability of our graph inference solution.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0162-8828
1939-3539
2160-9292
1939-3539
DOI:10.1109/TPAMI.2021.3085755