Cross on Cross Attention: Deep Fusion Transformer for Image Captioning

Numerous studies have shown that in-depth mining of correlations between multi-modal features can help improve the accuracy of cross-modal data analysis tasks. However, the current image description methods based on the encoder-decoder framework only carry out the interaction and fusion of multi-mod...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 33; no. 8; p. 1
Main Authors	Zhang, Jing, Xie, Yingshuai, Ding, Weichao, Wang, Zhe
Format	Journal Article
Language	English
Published	New York IEEE 01.08.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Coders Coding Cognition cross on cross attention Data analysis Data integration Data mining Decoding deep fusion Transformer Encoders-Decoders Encoding Feature extraction global cross encoder Image captioning Modal data Semantics Transformers Visualization
Online Access	Get full text
ISSN	1051-8215 1558-2205
DOI	10.1109/TCSVT.2023.3243725

Cover

Loading…

More Information
Summary:	Numerous studies have shown that in-depth mining of correlations between multi-modal features can help improve the accuracy of cross-modal data analysis tasks. However, the current image description methods based on the encoder-decoder framework only carry out the interaction and fusion of multi-modal features in the encoding stage or the decoding stage, which cannot effectively alleviate the semantic gap. In this paper, we propose a Deep Fusion Transformer (DFT) for image captioning to provide a deep multi-feature and multi-modal information fusion strategy throughout the encoding to decoding process. We propose a novel global cross encoder to align different types of visual features, which can effectively compensate for the differences between features and incorporate each other's strengths. In the decoder, a novel cross on cross attention is proposed to realize hierarchical cross-modal data analysis, extending complex cross-modal reasoning capabilities through the multi-level interaction of visual and semantic features. Extensive experiments conducted on the MSCOCO dataset prove that our proposed DFT can achieve excellent performance and outperform state-of-the-art methods. The code is available at https://github.com/weimingboya/DFT.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3243725