Visual-Linguistic Alignment and Composition for Image Retrieval with Text Feedback
In this paper, we focus on the task of image retrieval with text feedback, which maintains two key challenges. One is the misalignment problem between different modalities, and the other is to selectively alter the corresponding attributes on the reference image according to the textual words. To th...
Saved in:
Published in | 2023 IEEE International Conference on Multimedia and Expo (ICME) pp. 108 - 113 |
---|---|
Main Authors | , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.07.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper, we focus on the task of image retrieval with text feedback, which maintains two key challenges. One is the misalignment problem between different modalities, and the other is to selectively alter the corresponding attributes on the reference image according to the textual words. To this end, we propose a novel visual-linguistic alignment and composition network (ACNet) consisting of two key components: the modality alignment module (MAM) and the relation composition module (RCM). Specifically, the MAM performs alignment between the features from different modalities by applying image-text contrastive loss. The RCM correlates the image regions with their corresponding words and then adaptively modifies the specific regions of the reference image conditioned on textual semantics. Quantitative and Qualitative experiments on three datasets not only demonstrate that our ACNet outperforms state-of-the-art models, but also verify the effectiveness of our method. |
---|---|
ISSN: | 1945-788X |
DOI: | 10.1109/ICME55011.2023.00027 |