Union-Redefined Prototype Network for scene graph generation

Recent advances in scene graph generation employ commonsense knowledge to model visual prototypes for various predicates. However, indiscriminate application of prototypes to all entity pairs within a predicate can lead to confusion when interpreting the same entity pairs that share similar visual f...

Full description

Saved in:

Bibliographic Details
Published in	Expert systems with applications Vol. 280; p. 127486
Main Authors	Jung, NamGyu, Choi, Chang
Format	Journal Article
Language	English
Published	Elsevier Ltd 25.06.2025
Subjects	Commonsense knowledge Non-overlapping regions Predicate differentiation Scene graph generation Visual relation detection Visual relation detection Non-overlapping regions Scene graph generation Commonsense knowledge Predicate differentiation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recent advances in scene graph generation employ commonsense knowledge to model visual prototypes for various predicates. However, indiscriminate application of prototypes to all entity pairs within a predicate can lead to confusion when interpreting the same entity pairs that share similar visual features but correspond to different predicates. Particularly, clearly distinguishing predicates often requires careful consideration of subtle factors, such as gaze direction, spatial distance, and surrounding context. While these aspects are inherently part of the non-overlapping regions between the subject and object, they are frequently overlooked in practice. In this paper, we propose the Union-Redefined Prototype Network (UP-Net), which effectively captures predicate-specific visual nuances by leveraging the subject and object’s positional information and non-overlapping areas. Our method redefines entity pairs by incorporating the often-overlooked union region, excluding the intersection between the subject and object. Furthermore, we enhance the discriminative power among predicates by increasing the divergence between predicate-specific representations of the same entity pairs, thereby capturing the subtle visual nuances associated with each predicate. Extensive experiments on the Visual Genome dataset demonstrate that our approach achieves state-of-the-art performance.
ISSN:	0957-4174
DOI:	10.1016/j.eswa.2025.127486