Multiscale Conditional Relationship Graph Network for Referring Relationships in Images

Images contain not only individual entities but also abundant visual relationships between entities. Therefore, conditioned on visual relationship triples <inline-formula> <tex-math notation="LaTeX">{< }subject-relationship-object{>} </tex-math></inline-formula&g...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on cognitive and developmental systems Vol. 14; no. 2; pp. 752 - 760
Main Authors Zhu, Jian, Wang, Hanli
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.06.2022
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text
ISSN2379-8920
2379-8939
DOI10.1109/TCDS.2021.3079278

Cover

Loading…
More Information
Summary:Images contain not only individual entities but also abundant visual relationships between entities. Therefore, conditioned on visual relationship triples <inline-formula> <tex-math notation="LaTeX">{< }subject-relationship-object{>} </tex-math></inline-formula> that can be viewed as structured texts, entities (subjects or objects) can be localized in images without ambiguity. However, it is challenging to efficiently model visual relationships since a specific relationship usually has dramatic intraclass visual differences when involved with different entities, quite a number of which are in a small scale. In addition, the subject and the object in a relationship triple may have different best scales, and matching the subject and the object with different appropriate scales may improve prediction. To address these issues, a multiscale conditional relationship graph network (CRGN) is proposed in this article to localize entities based on visual relationships. Specifically, an attention pyramid network is first introduced to generate multiscale attention maps to capture entities with various sizes for entity matching. Then, a CRGN is further designed to aggregate and refine multiscale attention features to localize entities via passing relationship contexts between entity attention maps, which sufficiently utilizes the entity attention maps with the best scales. In order to mitigate the negative effects of intraclass visual differences of relationships, vision-agnostic relationship features are utilized in the proposed CRGN to indirectly model relationship contexts. The experiments demonstrate the superiority of the proposed method compared with the previous powerful frameworks on three challenging benchmark data sets, including CLEVR, Visual Genome, and VRD. The project page can be found in https://mic.tongji.edu.cn/d9/5c/c9778a186716/page.htm .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2379-8920
2379-8939
DOI:10.1109/TCDS.2021.3079278