Cross on Cross Attention: Deep Fusion Transformer for Image Captioning
Numerous studies have shown that in-depth mining of correlations between multi-modal features can help improve the accuracy of cross-modal data analysis tasks. However, the current image description methods based on the encoder-decoder framework only carry out the interaction and fusion of multi-mod...
Saved in:
Published in | IEEE transactions on circuits and systems for video technology Vol. 33; no. 8; p. 1 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.08.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
ISSN | 1051-8215 1558-2205 |
DOI | 10.1109/TCSVT.2023.3243725 |
Cover
Loading…
Summary: | Numerous studies have shown that in-depth mining of correlations between multi-modal features can help improve the accuracy of cross-modal data analysis tasks. However, the current image description methods based on the encoder-decoder framework only carry out the interaction and fusion of multi-modal features in the encoding stage or the decoding stage, which cannot effectively alleviate the semantic gap. In this paper, we propose a Deep Fusion Transformer (DFT) for image captioning to provide a deep multi-feature and multi-modal information fusion strategy throughout the encoding to decoding process. We propose a novel global cross encoder to align different types of visual features, which can effectively compensate for the differences between features and incorporate each other's strengths. In the decoder, a novel cross on cross attention is proposed to realize hierarchical cross-modal data analysis, extending complex cross-modal reasoning capabilities through the multi-level interaction of visual and semantic features. Extensive experiments conducted on the MSCOCO dataset prove that our proposed DFT can achieve excellent performance and outperform state-of-the-art methods. The code is available at https://github.com/weimingboya/DFT. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1051-8215 1558-2205 |
DOI: | 10.1109/TCSVT.2023.3243725 |