Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features fro...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 30; no. 12; pp. 4467 - 4480
Main Authors	Yu, Jun, Li, Jing, Yu, Zhou, Huang, Qingming
Format	Journal Article
Language	English
Published	New York IEEE 01.12.2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Ablation Adaptation models Artificial neural networks Coders Computational modeling Convolution Decoding deep learning Encoders-Decoders Feature extraction Hidden Markov models Image captioning Machine translation multi-view learning Neural networks Recurrent neural networks Task analysis Transformers Visualization
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!