Contextual Translation Embedding for Visual Relationship Detection and Scene Graph Generation

Relations amongst entities play a central role in image understanding. Due to the complexity of modeling ( subject , predicate , object ) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously pro...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on pattern analysis and machine intelligence Vol. 43; no. 11; pp. 3820 - 3832
Main Authors	Hung, Zih-Siou, Mallya, Arun, Lazebnik, Svetlana
Format	Journal Article
Language	English
Published	United States IEEE 01.11.2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Bicycles Embedding Feature extraction Image edge detection scene graph generation scene understanding Semantics Task analysis Training Visual relationship detection Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Relations amongst entities play a central role in image understanding. Due to the complexity of modeling ( subject , predicate , object ) relation triplets, it is crucial to develop a method that can not only recognize seen relations, but also generalize to unseen cases. Inspired by a previously proposed visual translation embedding model, or VTransE <xref ref-type="bibr" rid="ref1">[1] , we propose a context-augmented translation embedding model that can capture both common and rare relations. The previous VTransE model maps entities and predicates into a low-dimensional embedding vector space where the predicate is interpreted as a translation vector between the embedded features of the bounding box regions of the subject and the object . Our model additionally incorporates the contextual information captured by the bounding box of the union of the subject and the object, and learns the embeddings guided by the constraint predicate <inline-formula><tex-math notation="LaTeX">\approx</tex-math> <mml:math><mml:mo>≈</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq1-2992222.gif"/> </inline-formula> union ( subject , object ) <inline-formula><tex-math notation="LaTeX">-</tex-math> <mml:math><mml:mo>-</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq2-2992222.gif"/> </inline-formula> subject <inline-formula><tex-math notation="LaTeX">-</tex-math> <mml:math><mml:mo>-</mml:mo></mml:math><inline-graphic xlink:href="hung-ieq3-2992222.gif"/> </inline-formula> object . In a comprehensive evaluation on multiple challenging benchmarks, our approach outperforms previous translation-based models and comes close to or exceeds the state of the art across a range of settings, from small-scale to large-scale datasets, from common to previously unseen relations. It also achieves promising results for the recently introduced task of scene graph generation.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	0162-8828 1939-3539 2160-9292 1939-3539
DOI:	10.1109/TPAMI.2020.2992222