Multimodal Transformer With Multi-View Visual Representation for Image Captioning
Image captioning aims to automatically generate a natural language description of a given image, and most state-of-the-art models have adopted an encoder-decoder framework. The framework consists of a convolution neural network (CNN)-based image encoder that extracts region-based visual features fro...
Saved in:
Published in | IEEE transactions on circuits and systems for video technology Vol. 30; no. 12; pp. 4467 - 4480 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
New York
IEEE
01.12.2020
The Institute of Electrical and Electronics Engineers, Inc. (IEEE) |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Be the first to leave a comment!