Rethinking referring relationships from a perspective of mask-level relational reasoning

•We rethink RR task from the perspective of Mask-level Relational Reasoning. It makes the proposed method more explanatory and extensible.•We design two modules: Mask Generate and Mask Transfer. They jointly help the model learn more language priors and multimodal information.•We introduce an image-...

Full description

Saved in:

Bibliographic Details
Published in	Pattern recognition Vol. 133; p. 109044
Main Authors	Li, Chengyang, Zhu, Liping, Tian, Gangyi, Hou, Yi, Zhou, Heng
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.01.2023
Subjects	Deep learning Image and text Multimodal learning Referring relationship Visual grounding Deep learning Visual grounding Referring relationship Image and text Multimodal learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•We rethink RR task from the perspective of Mask-level Relational Reasoning. It makes the proposed method more explanatory and extensible.•We design two modules: Mask Generate and Mask Transfer. They jointly help the model learn more language priors and multimodal information.•We introduce an image-to-text relational reasoning module, which is unsupervised. It improves the generalization ability of the multimodal model.•Our method achieves state-of-the-art accuracy on two challenging datasets, VRD and Visual Genome. Referring relationship aims at localizing subject and object entities in an image, according to a triple text <subject, predicate, object>. Previous methods use iterative attention to shift between image regions for modeling predicate. However, predicate sometimes is implicit and difficult to be represented in the image domain. Convolution modeling method to express predicate is simple and inappropriate. Besides, relational reasoning information in the text itself is not fully utilized. To this end, we rethink referring relationship from a mask-level relational reasoning perspective to improve model interpretability. For text-to-image reasoning, we design Mask Generate and Mask Transfer modules, so as to fully integrate the text priors into the reasoning and prediction of masks. For image-to-text reasoning, we propose an unsupervised triple reconstruction method to guide text-to-image reasoning and improve multimodal generalization. By bi-directional reasoning between image and text, the proposed method MRR fully conforms to the multimodal relational reasoning process. Experiments show that MRR achieves state-of-the-art performance on two datasets of referring relationships, VRD and Visual Genome.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2022.109044