The synergy of double attention: Combine sentence-level and word-level attention for image captioning

The existing attention models of image captioning typically extract only word-level attention information, i.e., the attention mechanism extracts local attention information from the image to generate the current word, and lacks accurate image global information guidance. In this paper, we first pro...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 201; p. 103068
Main Authors	Wei, Haiyang, Li, Zhixin, Zhang, Canlong, Ma, Huifang
Format	Journal Article
Language	English
Published	Elsevier Inc 01.12.2020
Subjects	Image captioning Reinforcement learning Sentence-level attention Word-level attention Sentence-level attention 41A10 Reinforcement learning 65D05 65D17 Image captioning 41A05 Word-level attention
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The existing attention models of image captioning typically extract only word-level attention information, i.e., the attention mechanism extracts local attention information from the image to generate the current word, and lacks accurate image global information guidance. In this paper, we first propose an image captioning approach based on self-attention. Sentence-level attention information is extracted from the image through self-attention mechanism to represent the global image information needed to generate sentences. Furthermore, we propose a double attention model which combines sentence-level attention model with word-level attention model to generate more accurate captions. We implement supervision and optimization in the intermediate stage of the model to solve information interference problems. In addition, we perform two-stage training with reinforcement learning to optimize the evaluation metric of the model. Finally, we evaluated our model on three standard datasets, i.e., Flickr8k, Flickr30k and MSCOCO. Experimental results show that our double attention model can generate more accurate and richer captions, and outperforms many state-of-the-art image captioning approaches in various evaluation metrics. •We apply the self-attention mechanism to the task of image captioning.•We construct a double attention model with sentence-level and word-level attention.•We perform two-stage training with reinforcement learning.
ISSN:	1077-3142 1090-235X
DOI:	10.1016/j.cviu.2020.103068