Dual-path Rare Content Enhancement Network for Image and Text Matching

Image and text matching plays a crucial role in bridging the cross-modal gap between vision and language, and has achieved great progress due to the deep learning. However, the existing methods still suffer from the long-tail problem, where only a small proportion contains highly frequent semantics...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 33; no. 10; p. 1
Main Authors	Wang, Yan, Su, Yuting, Li, Wenhui, Xiao, Jun, Li, Xuanya, Liu, An-An
Format	Journal Article
Language	English
Published	New York IEEE 01.10.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Adaptation models Fuses image and text matching Image enhancement Indexes long-tail effect Matching Mortar Rare content enhancement Representations Semantics Training Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Image and text matching plays a crucial role in bridging the cross-modal gap between vision and language, and has achieved great progress due to the deep learning. However, the existing methods still suffer from the long-tail problem, where only a small proportion contains highly frequent semantics and a long tail proportion is constructed by rare semantics. In this paper, we propose a novel Dual-path Rare Content Enhancement Network (DRCE) to tackle the long-tail issue. Specifically, the Cross-modal Representation Enhancement (CRE) and Cross-modal Association Enhancement (CAE) are proposed to construct dual-path structure to enhance rare content representation and association with the benefit of cross-modal prior knowledge. This structure can effectively exploit the complementary cross-modal relation from different aspects and fuse these information in an adaptively manner by the proposed Adaptive Fusion Strategy (AFS). Moreover, we also propose an alternative re-ranking strategy (ARR) to explore the reciprocal contextual information to refine image-text matching results, which can further suppress the negative effect of long-tail effect. Extensive experiments on two large-scale datasets show the significant improvements and validate the superiority of our method.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3254530