A neural image captioning model with caption-to-images semantic constructor
The current dominant image captioning models are mostly based on a CNN-LSTM encoder-decoder framework. Although this architecture has achieved remarkable progress, it still has shortcomings for not fully capturing the encoded image information. Specifically, the model only exploits image-to-caption...
Saved in:
Published in | Neurocomputing (Amsterdam) Vol. 367; pp. 144 - 151 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
20.11.2019
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | The current dominant image captioning models are mostly based on a CNN-LSTM encoder-decoder framework. Although this architecture has achieved remarkable progress, it still has shortcomings for not fully capturing the encoded image information. Specifically, the model only exploits image-to-caption dependency during the process of caption generation. In this paper, we extend the conventional CNN-LSTM image captioning model by introducing a caption-to-images semantic reconstructor, which reconstructs the semantic representations of the input image and its similar images from hidden states of the decoder. Serving as an auxiliary objective that evaluates the fidelity of the generated caption, the reconstruction score of semantic reconstructor is combined with the likelihood to refine model training. In this way, semantics of input image can be more effectively transferred to the decoder and be fully exploited to generate better captions. Besides, during model testing, the reconstruction score can be used along with log likelihood to select better caption via reranking. Experimental results show that the proposed model significantly improves the quality of the generated captions and outperforms a conventional image captioning model, LSTM-A5. |
---|---|
ISSN: | 0925-2312 1872-8286 |
DOI: | 10.1016/j.neucom.2019.08.012 |