Context-Aware Graph Inference With Knowledge Distillation for Visual Dialog
Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relational inference in a graphical model with sparse contextual subjects (nodes) and unknown graph structure (relation descriptor);...
Saved in:
Published in | IEEE transactions on pattern analysis and machine intelligence Vol. 44; no. 10; pp. 6056 - 6073 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.10.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Visual dialog is a challenging task that requires the comprehension of the semantic dependencies among implicit visual and textual contexts. This task can refer to the relational inference in a graphical model with sparse contextual subjects (nodes) and unknown graph structure (relation descriptor); how to model the underlying context-aware relational inference is critical. To this end, we propose a novel context-aware graph (CAG) neural network. We focus on the exploitation of fine-grained relational reasoning with object-level dialog-historical co-reference nodes. The graph structure (relation in dialog) is iteratively updated using an adaptive top-<inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq1-3085755.gif"/> </inline-formula> message passing mechanism. To eliminate sparse useless relations, each node has dynamic relations in the graph (different related <inline-formula><tex-math notation="LaTeX">K</tex-math> <mml:math><mml:mi>K</mml:mi></mml:math><inline-graphic xlink:href="wang-ieq2-3085755.gif"/> </inline-formula> neighbor nodes), and only the most relevant nodes are attributive to the context-aware relational graph inference. In addition, to avoid negative performance caused by linguistic bias of history, we propose a pure visual-aware knowledge distillation mechanism named CAG-Distill, in which image-only visual clues are used to regularize the joint dialog-historical contextual awareness at the object-level. Experimental results on VisDial v0.9 and v1.0 datasets show that both CAG and CAG-Distill outperform comparative methods. Visualization results further validate the remarkable interpretability of our graph inference solution. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
ISSN: | 0162-8828 1939-3539 2160-9292 1939-3539 |
DOI: | 10.1109/TPAMI.2021.3085755 |