Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation

Relations amongst entities play a central role in image understanding. Due to the complexity of modeling ( subject , predicate , object ) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously pro...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on pattern analysis and machine intelligence Vol. 43; no. 11; pp. 3820 - 3832
Main Authors Hung, Zih-Siou, Mallya, Arun, Lazebnik, Svetlana
Format Journal Article
LanguageEnglish
Published United States IEEE 01.11.2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Relations amongst entities play a central role in image understanding. Due to the complexity of modeling ( subject , predicate , object ) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE <xref ref-type="bibr" rid="ref1">[1] , we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object . Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate <inline-formula><tex-math notation="LaTeX">\approx</tex-math> <mml:math><mml:mo>≈</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq1-2992222.gif"/> </inline-formula> union ( subject , object ) <inline-formula><tex-math notation="LaTeX">-</tex-math> <mml:math><mml:mo>-</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq2-2992222.gif"/> </inline-formula> subject <inline-formula><tex-math notation="LaTeX">-</tex-math> <mml:math><mml:mo>-</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq3-2992222.gif"/> </inline-formula> object . In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
content type line 23
ISSN:0162-8828
1939-3539
2160-9292
1939-3539
DOI:10.1109/TPAMI.2020.2992222