Image Captioning Using Motion-CNN with Object Detection

Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the...

Full description

Saved in:

Bibliographic Details
Published in	Sensors (Basel, Switzerland) Vol. 21; no. 4; p. 1270
Main Authors	Iwamura, Kiyohiko, Louhi Kasahara, Jun Younes, Moro, Alessandro, Yamashita, Atsushi, Asama, Hajime
Format	Journal Article
Language	English
Published	Switzerland MDPI AG 10.02.2021 MDPI
Subjects	Accuracy deep learning Humans image captioning Image Processing, Computer-Assisted Indexing services Model accuracy Motion motion estimation Motion perception Neural networks object detection People with disabilities Vision Disorders motion estimation deep learning object detection image captioning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Automatic image captioning has many important applications, such as the depiction of visual contents for visually impaired people or the indexing of images on the internet. Recently, deep learning-based image captioning models have been researched extensively. For caption generation, they learn the relation between image features and words included in the captions. However, image features might not be relevant for certain words such as verbs. Therefore, our earlier reported method included the use of motion features along with image features for generating captions including verbs. However, all the motion features were used. Since not all motion features contributed positively to the captioning process, unnecessary motion features decreased the captioning accuracy. As described herein, we use experiments with motion features for thorough analysis of the reasons for the decline in accuracy. We propose a novel, end-to-end trainable method for image caption generation that alleviates the decreased accuracy of caption generation. Our proposed model was evaluated using three datasets: MSR-VTT2016-Image, MSCOCO, and several copyright-free images. Results demonstrate that our proposed method improves caption generation performance.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1424-8220 1424-8220
DOI:	10.3390/s21041270