LSTM-in-LSTM for generating long descriptions of images

In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual...

Full description

Saved in:
Bibliographic Details
Published inComputational visual media (Beijing) Vol. 2; no. 4; pp. 379 - 388
Main Authors Song, Jun, Tang, Siliang, Xiao, Jun, Wu, Fei, Zhang, Zhongfei (Mark)
Format Journal Article
LanguageEnglish
Published Beijing Tsinghua University Press 01.12.2016
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM (long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues (i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images (i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector (via the outer LSTM), and a context vector of fine-grained visual cues (via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets (Flickr8k, Flickr30k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics (BLEU, CIDEr, ROUGE-L, and METEOR).
ISSN:2096-0433
2096-0662
DOI:10.1007/s41095-016-0059-z