Dual-Level Representation Enhancement on Characteristic and Context for Image-Text Retrieval

Image-text retrieval is a fundamental and vital task in multi-media retrieval and has received growing attention since it connects heterogeneous data. Previous methods that perform well on image-text retrieval mainly focus on the interaction between image regions and text words. But these approaches...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 32; no. 11; pp. 8037 - 8050
Main Authors	Yang, Song, Li, Qiang, Li, Wenhui, Li, Xuanya, Liu, An-An
Format	Journal Article
Language	English
Published	New York IEEE 01.11.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Adaptive filters Correlation Data mining Dual-level feature enhancement Feature extraction Filtration Image enhancement image-text retrieval Learning systems Matching Modules multi-block matching Multimedia Representations Retrieval Semantics Task analysis Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Image-text retrieval is a fundamental and vital task in multi-media retrieval and has received growing attention since it connects heterogeneous data. Previous methods that perform well on image-text retrieval mainly focus on the interaction between image regions and text words. But these approaches lack joint exploration of characteristics and contexts of regions and words, which will cause semantic confusion of similar objects and loss of contextual understanding. To address these issues, a dual-level representation enhancement network (DREN) is proposed to strength the characteristic and contextual representations by innovative block-level and instance-level representation enhancement modules, respectively. The block-level module focuses on mining the potential relations between multiple blocks within each instance representation, while the instance-level module concentrates on learning the contextual relations between different instances. To facilitate the accurate matching of image-text pairs, we propose the graph correlation inference and weighted adaptive filtering to conduct the local and global matching between image-text pairs. Extensive experiments on two challenging datasets (i.e., Flickr30K and MSCOCO) verify the superiority of our method for image-text retrieval.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2022.3182426