Rethinking referring relationships from a perspective of mask-level relational reasoning
•We rethink RR task from the perspective of Mask-level Relational Reasoning. It makes the proposed method more explanatory and extensible.•We design two modules: Mask Generate and Mask Transfer. They jointly help the model learn more language priors and multimodal information.•We introduce an image-...
Saved in:
Published in | Pattern recognition Vol. 133; p. 109044 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.01.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •We rethink RR task from the perspective of Mask-level Relational Reasoning. It makes the proposed method more explanatory and extensible.•We design two modules: Mask Generate and Mask Transfer. They jointly help the model learn more language priors and multimodal information.•We introduce an image-to-text relational reasoning module, which is unsupervised. It improves the generalization ability of the multimodal model.•Our method achieves state-of-the-art accuracy on two challenging datasets, VRD and Visual Genome.
Referring relationship aims at localizing subject and object entities in an image, according to a triple text <subject, predicate, object>. Previous methods use iterative attention to shift between image regions for modeling predicate. However, predicate sometimes is implicit and difficult to be represented in the image domain. Convolution modeling method to express predicate is simple and inappropriate. Besides, relational reasoning information in the text itself is not fully utilized. To this end, we rethink referring relationship from a mask-level relational reasoning perspective to improve model interpretability. For text-to-image reasoning, we design Mask Generate and Mask Transfer modules, so as to fully integrate the text priors into the reasoning and prediction of masks. For image-to-text reasoning, we propose an unsupervised triple reconstruction method to guide text-to-image reasoning and improve multimodal generalization. By bi-directional reasoning between image and text, the proposed method MRR fully conforms to the multimodal relational reasoning process. Experiments show that MRR achieves state-of-the-art performance on two datasets of referring relationships, VRD and Visual Genome. |
---|---|
ISSN: | 0031-3203 1873-5142 |
DOI: | 10.1016/j.patcog.2022.109044 |