Generative adversarial network for semi-supervised image captioning

Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve th...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 249; p. 104199
Main Authors	Liang, Xu, Li, Chen, Tian, Lihua
Format	Journal Article
Language	English
Published	Elsevier Inc 01.12.2024
Subjects	CLIP Generative adversarial network Image captioning Semi-supervised Transformer Transformer Image captioning Semi-supervised Generative adversarial network CLIP
Online Access	Get full text
ISSN	1077-3142
DOI	10.1016/j.cviu.2024.104199

Cover

Loading…

More Information
Summary:	Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks. •The proposed method is to generate images for captions instead of captions for images.•CLIP is used to constraint generator to associate images with captions.•Parameter updates can be completed through backpropagation.
ISSN:	1077-3142
DOI:	10.1016/j.cviu.2024.104199