One-Stage Visual Relationship Referring With Transformers and Adaptive Message Passing

There exist a variety of visual relationships among entities in an image. Given a relationship query <inline-formula> <tex-math notation="LaTeX">\langle subject, predicate, object \rangle </tex-math></inline-formula>, the task of visual relationship referring (VRR)...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 32; pp. 190 - 202
Main Authors	Wang, Hang, Du, Youtian, Zhang, Yabin, Li, Shuai, Zhang, Lei
Format	Journal Article
Language	English
Published	United States IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Coders Detectors gated message passing Message passing Modules one-stage Predictive models Proposals Representations Task analysis transformer Transformers Visual relationship referring Visual tasks Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	There exist a variety of visual relationships among entities in an image. Given a relationship query <inline-formula> <tex-math notation="LaTeX">\langle subject, predicate, object \rangle </tex-math></inline-formula>, the task of visual relationship referring (VRR) aims to disambiguate instances of the same entity category and simultaneously localize the subject and object entities in an image. Previous works of VRR can be generally categorized into one-stage and multi-stage methods. The former ones directly localize a pair of entities from the image but they suffer from low prediction accuracy, while the latter ones perform better but they are indirect to localize only a couple of entities by pre-generating a rich amount of candidate proposals. In this paper, we formulate the task of VRR as an end-to-end bounding box regression problem and propose a novel one-stage approach, called VRR-TAMP, by effectively integrating Transformers and an adaptive message passing mechanism. First, visual relationship queries and images are respectively encoded to generate the basic modality-specific embeddings, which are then fed into a cross-modal Transformer encoder to produce the joint representation. Second, to obtain the specific representation of each entity, we introduce an adaptive message passing mechanism and design an entity-specific information distiller SR-GMP, which refers to a gated message passing (GMP) module that works on the joint representation learned from a single learnable token. The GMP module adaptively distills the final representation of an entity by incorporating the contextual cues regarding the predicate and the other entity. Experiments on VRD and Visual Genome datasets demonstrate that our approach significantly outperforms its one-stage competitors and achieves competitive results with the state-of-the-art multi-stage methods.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1057-7149 1941-0042 1941-0042
DOI:	10.1109/TIP.2022.3226624