Clothes image caption generation with attribute detection and visual attention model

•An end to end framework for clothes image captioning is developed based on attribute detection and visual attention.•The attribute detection module is developed to provide the encoded feature of clothes with more attribute information.•An attention gate is adopted to better characterize the similar...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition letters Vol. 141; pp. 68 - 74
Main Authors Li, Xianrui, Ye, Zhiling, Zhang, Zhao, Zhao, Mingbo
Format Journal Article
LanguageEnglish
Published Amsterdam Elsevier B.V 01.01.2021
Elsevier Science Ltd
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:•An end to end framework for clothes image captioning is developed based on attribute detection and visual attention.•The attribute detection module is developed to provide the encoded feature of clothes with more attribute information.•An attention gate is adopted to better characterize the similarity between image portion and captioning description.•Extensive simulations based on real-world data are conducted to verify the effectiveness of the proposed method. Fashion is a multi-billion-dollar industry, which is directly related to social, cultural, and economic implications in the real world. While computer vision has demonstrated remarkable success in the applications of the fashion domain, natural language processing technology has become contributed in the area, so that it can build the connection between clothes image and human semantic understandings. An element work for combing images and language understanding is how to generate a natural language sentence that accurately summarizes the contents of a clothes image. In this paper, we develop a joint attribute detection and visual attention framework for clothes image captioning. Specifically, in order to involve more attributes of clothes to learn, we first utilize a pre-trained Convolutional Neural Network (CNN) to learn the feature that can characterize more information about clothing attribute. Based on such learned feature, we then adopt an encoder/decoder framework, where we first encoder the feature of clothes and then and input it to a language Long Short-Term Memory(LSTM) model for decoding the clothes descriptions. The method greatly enhances the performance of clothes image captioning and reduces the misleading attention. Extensive simulations based on real-world data verify the effectiveness of the proposed method.
ISSN:0167-8655
1872-7344
DOI:10.1016/j.patrec.2020.12.001