Global-aware Fragment Representation Aggregation Network for image–text retrieval

Image–text retrieval is an important kind of cross-modal retrieval method and has recently attracted much attention. Existing image–text retrieval methods often ignore the relative importance of each fragment (region in an image or word in a sentence) on the global semantic of image or text when agg...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition Vol. 159; p. 111085
Main Authors Wang, Di, Tian, Jiabo, Liang, Xiao, Tian, Yumin, He, Lihuo
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.03.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Image–text retrieval is an important kind of cross-modal retrieval method and has recently attracted much attention. Existing image–text retrieval methods often ignore the relative importance of each fragment (region in an image or word in a sentence) on the global semantic of image or text when aggregating features of image or text fragments, resulting in the ineffectiveness of the learned image and text representations. To address this problem, we propose an image–text retrieval method named Global-aware Fragment Representation Aggregation Network (GFRAN). Specifically, it first designs a fine-grained multimodal information interaction module based on the self-attention mechanism to model both the intra-modality and inter-modality relationships between image regions and words. Then, with the guidance of the global image or text feature, it aggregates image or text fragment features conditioned on their attention weights over global feature, to highlight fragments that contribute more to the overall semantics of images and texts. Extensive experiments on two benchmark datasets Flickr30K and MS-COCO demonstrate the superiority of the proposed GFRAN model over several state-of-the-art baselines. •A novel image–text retrieval network GFRAN is proposed.•A global-aware aggregation module is proposed to highlight fragment features.•A interaction module is proposed to capture multi-modality correlations.•Extensive experimental results demonstrate its superior performance.
ISSN:0031-3203
DOI:10.1016/j.patcog.2024.111085