Multiscale Conditional Relationship Graph Network for Referring Relationships in Images

Images contain not only individual entities but also abundant visual relationships between entities. Therefore, conditioned on visual relationship triples <inline-formula> <tex-math notation="LaTeX">{< }subject-relationship-object{>} </tex-math></inline-formula&g...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on cognitive and developmental systems Vol. 14; no. 2; pp. 752 - 760
Main Authors	Zhu, Jian, Wang, Hanli
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.06.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Computational modeling Conditional relationship Feature extraction Genomics graph neural network (GNN) Graph neural networks Matching multiscale features Object detection referring relationships Task analysis Visual effects Visualization
Online Access	Get full text
ISSN	2379-8920 2379-8939
DOI	10.1109/TCDS.2021.3079278

Cover

Loading…

More Information
Summary:	Images contain not only individual entities but also abundant visual relationships between entities. Therefore, conditioned on visual relationship triples <inline-formula> <tex-math notation="LaTeX">{< }subject-relationship-object{>} </tex-math></inline-formula> that can be viewed as structured texts, entities (subjects or objects) can be localized in images without ambiguity. However, it is challenging to efficiently model visual relationships since a specific relationship usually has dramatic intraclass visual differences when involved with different entities, quite a number of which are in a small scale. In addition, the subject and the object in a relationship triple may have different best scales, and matching the subject and the object with different appropriate scales may improve prediction. To address these issues, a multiscale conditional relationship graph network (CRGN) is proposed in this article to localize entities based on visual relationships. Specifically, an attention pyramid network is first introduced to generate multiscale attention maps to capture entities with various sizes for entity matching. Then, a CRGN is further designed to aggregate and refine multiscale attention features to localize entities via passing relationship contexts between entity attention maps, which sufficiently utilizes the entity attention maps with the best scales. In order to mitigate the negative effects of intraclass visual differences of relationships, vision-agnostic relationship features are utilized in the proposed CRGN to indirectly model relationship contexts. The experiments demonstrate the superiority of the proposed method compared with the previous powerful frameworks on three challenging benchmark data sets, including CLEVR, Visual Genome, and VRD. The project page can be found in https://mic.tongji.edu.cn/d9/5c/c9778a186716/page.htm .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2379-8920 2379-8939
DOI:	10.1109/TCDS.2021.3079278