Multimodal Transformer With Multi-View Visual Representation for Image Captioning

Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features fro...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 30; no. 12; pp. 4467 - 4480
Main Authors Yu, Jun, Li, Jing, Yu, Zhou, Huang, Qingming
Format Journal Article
LanguageEnglish
Published New York IEEE 01.12.2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…