Multi-Task Visual Semantic Embedding Network for Image-Text Retrieval

Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pai...

Full description

Saved in:

Bibliographic Details
Published in	Journal of computer science and technology Vol. 39; no. 4; pp. 811 - 826
Main Authors	Qin, Xue-Yang, Li, Li-Shuang, Tang, Jing-Yao, Hao, Fei, Ge, Mei-Ling, Pang, Guang-Yao
Format	Journal Article
Language	English
Published	Singapore Springer Nature Singapore 01.07.2024 Springer Nature B.V
Subjects	Artificial Intelligence Artificial neural networks Computer Science Data Structures and Information Theory Embedding Graphical representations Information flow Information Systems Applications (incl.Internet) Multilayers Regular Paper Retrieval Semantics Software Engineering Theory of Computation Visual discrimination Visual tasks graph convolutional network image-text retrieval cross-modal retrieval multi-task learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Image-text retrieval aims to capture the semantic correspondence between images and texts, which serves as a foundation and crucial component in multi-modal recommendations, search systems, and online shopping. Existing mainstream methods primarily focus on modeling the association of image-text pairs while neglecting the advantageous impact of multi-task learning on image-text retrieval. To this end, a multi-task visual semantic embedding network (MVSEN) is proposed for image-text retrieval. Specifically, we design two auxiliary tasks, including text-text matching and multi-label classification, for semantic constraints to improve the generalization and robustness of visual semantic embedding from a training perspective. Besides, we present an intra- and inter-modality interaction scheme to learn discriminative visual and textual feature representations by facilitating information flow within and between modalities. Subsequently, we utilize multi-layer graph convolutional networks in a cascading manner to infer the correlation of image-text pairs. Experimental results show that MVSEN outperforms state-of-the-art methods on two publicly available datasets, Flickr30K and MSCO-CO, with rSum improvements of 8.2% and 3.0%, respectively.
ISSN:	1000-9000 1860-4749
DOI:	10.1007/s11390-024-4125-1