Global-aware Fragment Representation Aggregation Network for image–text retrieval
Image–text retrieval is an important kind of cross-modal retrieval method and has recently attracted much attention. Existing image–text retrieval methods often ignore the relative importance of each fragment (region in an image or word in a sentence) on the global semantic of image or text when agg...
Saved in:
Published in | Pattern recognition Vol. 159; p. 111085 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.03.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Image–text retrieval is an important kind of cross-modal retrieval method and has recently attracted much attention. Existing image–text retrieval methods often ignore the relative importance of each fragment (region in an image or word in a sentence) on the global semantic of image or text when aggregating features of image or text fragments, resulting in the ineffectiveness of the learned image and text representations. To address this problem, we propose an image–text retrieval method named Global-aware Fragment Representation Aggregation Network (GFRAN). Specifically, it first designs a fine-grained multimodal information interaction module based on the self-attention mechanism to model both the intra-modality and inter-modality relationships between image regions and words. Then, with the guidance of the global image or text feature, it aggregates image or text fragment features conditioned on their attention weights over global feature, to highlight fragments that contribute more to the overall semantics of images and texts. Extensive experiments on two benchmark datasets Flickr30K and MS-COCO demonstrate the superiority of the proposed GFRAN model over several state-of-the-art baselines.
•A novel image–text retrieval network GFRAN is proposed.•A global-aware aggregation module is proposed to highlight fragment features.•A interaction module is proposed to capture multi-modality correlations.•Extensive experimental results demonstrate its superior performance. |
---|---|
ISSN: | 0031-3203 |
DOI: | 10.1016/j.patcog.2024.111085 |