Feature Fusion Based on Transformer for Cross-modal Retrieval

Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and t...

Full description

Saved in:
Bibliographic Details
Published inJournal of physics. Conference series Vol. 2558; no. 1; pp. 12012 - 12017
Main Authors Zhang, Guihao, Cao, Jiangzhong
Format Journal Article
LanguageEnglish
Published Bristol IOP Publishing 01.08.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Abstract With the popularity of the Internet and the rapid growth of multimodal data, multimodal retrieval has gradually become a hot area of research. As one of the important branches of multimodal retrieval, image-text retrieval aims to design a model to learn and align two modal data, image and text, in order to build a bridge of semantic association between the two heterogeneous data, so as to achieve unified alignment and retrieval. The current mainstream image-text cross-modal retrieval approaches have made good progress by designing a deep learning-based model to find potential associations between different modal data. In this paper, we design a transformer-based feature fusion network to fuse the information of two modalities in the feature extraction process, which can enrich the semantic connection between the modalities. Meanwhile, we conduct experiments on the benchmark dataset Flickr30k and get competitive results, where recall at 10 achieves 96.2% accuracy in image-to-text retrieval.
ISSN:1742-6588
1742-6596
DOI:10.1088/1742-6596/2558/1/012012