Visual-Linguistic Alignment and Composition for Image Retrieval with Text Feedback

In this paper, we focus on the task of image retrieval with text feedback, which maintains two key challenges. One is the misalignment problem between different modalities, and the other is to selectively alter the corresponding attributes on the reference image according to the textual words. To th...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE International Conference on Multimedia and Expo (ICME) pp. 108 - 113
Main Authors	Li, Dafeng, Zhu, Yingying
Format	Conference Proceeding
Language	English
Published	IEEE 01.07.2023
Subjects	Adaptation models contrastive learning Image retrieval multimodal representation learning multimodal retrieval Semantics Task analysis
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, we focus on the task of image retrieval with text feedback, which maintains two key challenges. One is the misalignment problem between different modalities, and the other is to selectively alter the corresponding attributes on the reference image according to the textual words. To this end, we propose a novel visual-linguistic alignment and composition network (ACNet) consisting of two key components: the modality alignment module (MAM) and the relation composition module (RCM). Specifically, the MAM performs alignment between the features from different modalities by applying image-text contrastive loss. The RCM correlates the image regions with their corresponding words and then adaptively modifies the specific regions of the reference image conditioned on textual semantics. Quantitative and Qualitative experiments on three datasets not only demonstrate that our ACNet outperforms state-of-the-art models, but also verify the effectiveness of our method.
ISSN:	1945-788X
DOI:	10.1109/ICME55011.2023.00027