Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation
Relations amongst entities play a central role in image understanding. Due to the complexity of modeling ( subject , predicate , object ) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously pro...
Saved in:
Published in | IEEE transactions on pattern analysis and machine intelligence Vol. 43; no. 11; pp. 3820 - 3832 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
United States
IEEE
01.11.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Relations amongst entities play a central role in image understanding. Due to the complexity of modeling ( subject , predicate , object ) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE <xref ref-type="bibr" rid="ref1">[1] , we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object . Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate <inline-formula><tex-math notation="LaTeX">\approx</tex-math> <mml:math><mml:mo>≈</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq1-2992222.gif"/> </inline-formula> union ( subject , object ) <inline-formula><tex-math notation="LaTeX">-</tex-math> <mml:math><mml:mo>-</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq2-2992222.gif"/> </inline-formula> subject <inline-formula><tex-math notation="LaTeX">-</tex-math> <mml:math><mml:mo>-</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq3-2992222.gif"/> </inline-formula> object . In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
ISSN: | 0162-8828 1939-3539 2160-9292 1939-3539 |
DOI: | 10.1109/TPAMI.2020.2992222 |