Visual-Linguistic Alignment and Composition for Image Retrieval with Text Feedback

In this paper, we focus on the task of image retrieval with text feedback, which maintains two key challenges. One is the misalignment problem between different modalities, and the other is to selectively alter the corresponding attributes on the reference image according to the textual words. To th...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE International Conference on Multimedia and Expo (ICME) pp. 108 - 113
Main Authors Li, Dafeng, Zhu, Yingying
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.07.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this paper, we focus on the task of image retrieval with text feedback, which maintains two key challenges. One is the misalignment problem between different modalities, and the other is to selectively alter the corresponding attributes on the reference image according to the textual words. To this end, we propose a novel visual-linguistic alignment and composition network (ACNet) consisting of two key components: the modality alignment module (MAM) and the relation composition module (RCM). Specifically, the MAM performs alignment between the features from different modalities by applying image-text contrastive loss. The RCM correlates the image regions with their corresponding words and then adaptively modifies the specific regions of the reference image conditioned on textual semantics. Quantitative and Qualitative experiments on three datasets not only demonstrate that our ACNet outperforms state-of-the-art models, but also verify the effectiveness of our method.
ISSN:1945-788X
DOI:10.1109/ICME55011.2023.00027