LSTM-in-LSTM for generating long descriptions of images

In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual...

Full description

Saved in:

Bibliographic Details
Published in	Computational visual media (Beijing) Vol. 2; no. 4; pp. 379 - 388
Main Authors	Song, Jun, Tang, Siliang, Xiao, Jun, Wu, Fei, Zhang, Zhongfei (Mark)
Format	Journal Article
Language	English
Published	Beijing Tsinghua University Press 01.12.2016 Springer Nature B.V
Subjects	Artificial Intelligence Cider Computer Graphics Computer Science Cues Descriptions Image Processing and Computer Vision Research Article User Interfaces and Human Computer Interaction Words (language) long short-term memory (LSTM) image description generation neural network computer vision
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues (i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images (i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets (Flickr8k, Flickr30k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics (BLEU, CIDEr, ROUGE-L, and METEOR).
ISSN:	2096-0433 2096-0662
DOI:	10.1007/s41095-016-0059-z