Feature Fusion Based on Transformer for Cross-modal Retrieval

Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and t...

Full description

Saved in:

Bibliographic Details
Published in	Journal of physics. Conference series Vol. 2558; no. 1; pp. 12012 - 12017
Main Authors	Zhang, Guihao, Cao, Jiangzhong
Format	Journal Article
Language	English
Published	Bristol IOP Publishing 01.08.2023
Subjects	Feature extraction Modal data Physics Retrieval Semantics Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and text, in order to build a bridge of semantic association between the two heterogeneous data, so as to achieve unified alignment and retrieval. The current mainstream image-text cross-modal retrieval approaches have made good progress by designing a deep learning-based model to find potential associations between different modal data. In this paper, we design a transformer-based feature fusion network to fuse the information of two modalities in the feature extraction process, which can enrich the semantic connection between the modalities. Meanwhile, we conduct experiments on the benchmark dataset Flickr30k and get competitive results, where recall at 10 achieves 96.2% accuracy in image-to-text retrieval.
ISSN:	1742-6588 1742-6596
DOI:	10.1088/1742-6596/2558/1/012012