Generative adversarial network for semi-supervised image captioning

Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve th...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 249; p. 104199
Main Authors	Liang, Xu, Li, Chen, Tian, Lihua
Format	Journal Article
Language	English
Published	Elsevier Inc 01.12.2024
Subjects	CLIP Generative adversarial network Image captioning Semi-supervised Transformer Transformer Image captioning Semi-supervised Generative adversarial network CLIP
Online Access	Get full text
ISSN	1077-3142
DOI	10.1016/j.cviu.2024.104199

Cover

Loading…

Abstract	Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks. •The proposed method is to generate images for captions instead of captions for images.•CLIP is used to constraint generator to associate images with captions.•Parameter updates can be completed through backpropagation.
AbstractList	Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks. •The proposed method is to generate images for captions instead of captions for images.•CLIP is used to constraint generator to associate images with captions.•Parameter updates can be completed through backpropagation.
ArticleNumber	104199
Author	Li, Chen Liang, Xu Tian, Lihua
Author_xml	– sequence: 1 givenname: Xu orcidid: 0009-0008-0045-0032 surname: Liang fullname: Liang, Xu email: liangxu@stu.xjtu.edu.cn – sequence: 2 givenname: Chen surname: Li fullname: Li, Chen email: cclidd@xjtu.edu.cn – sequence: 3 givenname: Lihua surname: Tian fullname: Tian, Lihua email: lhtian@xjtu.edu.cn
BookMark	eNp9kE1Lw0AQhvdQwbb6BzzlD6TuJJtuAl6kaBUKXvS87MekTG03ZXcb8d-bUE89dC4DA8_LvM-MTXznkbEH4AvgsHzcLWxPp0XBCzEcBDTNhE2BS5mXIIpbNotxxzmAaGDKVmv0GHSiHjPtegxRB9L7zGP66cJ31nYhi3igPJ6OGHqK6DI66C1mVh8TdZ789o7dtHof8f5_z9nX68vn6i3ffKzfV8-b3Jacp9ygk602pm6HASmgqhoudSO1Qa6XAkVTYiU0GDCiBgt1WXO5dK6yrXGGl3NWn3Nt6GIM2CpLSY9PpKBpr4CrUYDaqVGAGgWos4ABLS7QYxhqhN_r0NMZwqFUTxhUtITeoqOANinX0TX8D2tpeuE
CitedBy_id	crossref_primary_10_3390_foods14020286
Cites_doi	10.1109/CVPR52688.2022.01602 10.3115/1073083.1073135 10.1109/TMM.2021.3060948 10.1016/j.epsr.2023.109256 10.1109/WACV48630.2021.00059 10.1007/978-3-030-01246-5_31 10.1109/ICCV.2019.00473 10.1109/CVPR.2018.00454 10.1145/3652583.3658080 10.1109/ICCV.2015.303 10.1109/ICCV.2017.244 10.1109/CVPR.2015.7298932 10.1109/TIP.2021.3124155 10.1145/3343031.3350996 10.1109/CVPR42600.2020.01070 10.1609/aaai.v36i3.20160 10.1109/CVPR42600.2020.01059 10.1109/CVPR52688.2022.01042 10.1109/TMM.2023.3265842 10.1109/ICCV.2019.00751 10.1109/CVPR.2019.00425 10.1109/CVPR42600.2020.01106 10.3115/1626394.1626406 10.1109/ICCV.2019.01042 10.1016/j.displa.2023.102490 10.1016/j.patcog.2019.107085 10.1016/j.isprsjprs.2022.02.001 10.1109/CVPR.2015.7299087 10.1109/ICCV48922.2021.00986
ContentType	Journal Article
Copyright	2024 Elsevier Inc.
Copyright_xml	– notice: 2024 Elsevier Inc.
DBID	AAYXX CITATION
DOI	10.1016/j.cviu.2024.104199
DatabaseName	CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Engineering Computer Science
ExternalDocumentID	10_1016_j_cviu_2024_104199 S1077314224002807
GroupedDBID	--K --M -~X .DC .~1 0R~ 1B1 1~. 1~5 29F 4.4 457 4G. 5GY 5VS 6TJ 7-5 71M 8P~ AABNK AACTN AAEDT AAEDW AAIKC AAIKJ AAKOC AALRI AAMNW AAOAW AAQFI AAQXK AAXKI AAXUO AAYFN ABBOA ABEFU ABFNM ABJNI ABMAC ABXDB ACDAQ ACGFS ACNNM ACRLP ACZNC ADBBV ADEZE ADFGL ADJOM ADMUD ADTZH AEBSH AECPX AEKER AENEX AFJKZ AFKWA AFTJW AGHFR AGUBO AGYEJ AHJVU AHZHX AIALX AIEXJ AIKHN AITUG AJOXV AKRWK ALMA_UNASSIGNED_HOLDINGS AMFUW AMRAJ AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CAG COF CS3 DM4 DU5 EBS EFBJH EJD EO8 EO9 EP2 EP3 F0J F5P FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HF~ HVGLF HZ~ IHE J1W JJJVA KOM LG5 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG RNS ROL RPZ SDF SDG SDP SES SEW SPC SPCBC SST SSV SSZ T5K TN5 XPP ZMT ~G- AATTM AAYWO AAYXX ABWVN ACRPL ACVFH ADCNI ADNMO AEIPS AEUPX AFPUW AGCQF AGQPQ AGRNS AIGII AIIUN AKBMS AKYEP ANKPU APXCP BNPGV CITATION
ID	FETCH-LOGICAL-c300t-bed7fabb8ffff174155907a97abe0a64e493e54a1b1b481c1838076dd5cfbdb03
IEDL.DBID	.~1
ISSN	1077-3142
IngestDate	Tue Jul 01 04:32:11 EDT 2025 Thu Apr 24 22:58:52 EDT 2025 Sat Nov 23 15:55:06 EST 2024
IsPeerReviewed	true
IsScholarly	true
Keywords	Transformer Image captioning Semi-supervised Generative adversarial network CLIP
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c300t-bed7fabb8ffff174155907a97abe0a64e493e54a1b1b481c1838076dd5cfbdb03
ORCID	0009-0008-0045-0032
ParticipantIDs	crossref_citationtrail_10_1016_j_cviu_2024_104199 crossref_primary_10_1016_j_cviu_2024_104199 elsevier_sciencedirect_doi_10_1016_j_cviu_2024_104199
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	December 2024 2024-12-00
PublicationDateYYYYMMDD	2024-12-01
PublicationDate_xml	– month: 12 year: 2024 text: December 2024
PublicationDecade	2020
PublicationTitle	Computer vision and image understanding
PublicationYear	2024
Publisher	Elsevier Inc
Publisher_xml	– name: Elsevier Inc
References	Anderson, Fernando, Johnson, Gould (b2) 2016 Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S., 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2641–2649. Soni, Mehta (b33) 2023; 220 Lin (b20) 2004 Tarvainen, Valpola (b35) 2017; 30 Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H., 2018. Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4320–4328. Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving Language Understanding with Unsupervised Learning. Technical Report, OpenAI. Prakash, Karam (b26) 2021; 30 Song, Y., Chen, S., Zhao, Y., Jin, Q., 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 784–792. Zhu, Wang, Luo, Sun, Zheng, Wang, Chen (b47) 2022 Song, Guo, Zhou, Xu, Wang (b32) 2022 Wang, Y., Xu, J., Sun, Y., 2022. End-to-end transformer based model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. pp. 2585–2594. Yang (b41) 2024 Laina, I., Rupprecht, C., Navab, N., 2019. Towards unsupervised image captioning with shared multimodal embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7414–7424. Gu, J., Joty, S., Cai, J., Wang, G., 2018. Unpaired image captioning by language pivoting. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 503–519. Wang, Y., Cao, Y., Zha, Z.J., Zhang, J., Xiong, Z., 2020. Deep degradation prior for low-quality image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11049–11058. Zhu, Wang, Zhu, Sun, Zheng, Wang, Chen (b48) 2023; 26 Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695. Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232. Karpathy, A., Fei-Fei, L., 2015. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3128–3137. Ben, H., Wang, S., Wang, M., Hong, R., 2024. Pseudo Content Hallucination for Unpaired Image Captioning. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 320–329. Xie, Q., Luong, M.T., Hovy, E., Le, Q.V., 2020. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10687–10698. Huang, L., Wang, W., Chen, J., Wei, X.Y., 2019. Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4634–4643. Kim, Oh, Choi, Kweon (b16) 2024 Ben, Pan, Li, Yao, Hong, Wang, Mei (b4) 2021; 24 Nguyen, Suganuma, Okatani (b23) 2022 Zhang, Zeng, He, Liu (b45) 2020 Lee (b18) 2013 Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022. Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R., 2020. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10578–10587. Vedantam, R., Lawrence Zitnick, C., Parikh, D., 2015. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4566–4575. Chen, X., Jiang, M., Zhao, Q., 2021. Self-distillation for few-shot image captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 545–555. Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly (b9) 2020 Yang, Cui, Qin, Deng, Lan, Luo (b42) 2023; 79 Li, Du, He (b19) 2020; 100 Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G., 2019. Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10323–10332. Yang, Ni, Ren (b43) 2022; 186 Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, Zitnick (b21) 2014 Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C., 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16515–16525. Devlin, Chang, Lee, Toutanova (b8) 2018 Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark (b27) 2021 Feng, Y., Ma, L., Liu, W., Luo, J., 2019. Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4125–4134. Radford, Metz, Chintala (b28) 2015 Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio (b11) 2014; 27 Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b36) 2017; 30 Agarwal, A., Lavie, A., 2008. Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the Third Workshop on Statistical Machine Translation. pp. 115–118. Arjovsky, Chintala, Bottou (b3) 2017 Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318. Devlin (10.1016/j.cviu.2024.104199_b8) 2018 10.1016/j.cviu.2024.104199_b29 10.1016/j.cviu.2024.104199_b24 Radford (10.1016/j.cviu.2024.104199_b27) 2021 10.1016/j.cviu.2024.104199_b46 Kim (10.1016/j.cviu.2024.104199_b16) 2024 10.1016/j.cviu.2024.104199_b25 Zhang (10.1016/j.cviu.2024.104199_b45) 2020 Goodfellow (10.1016/j.cviu.2024.104199_b11) 2014; 27 10.1016/j.cviu.2024.104199_b31 10.1016/j.cviu.2024.104199_b10 Lin (10.1016/j.cviu.2024.104199_b21) 2014 10.1016/j.cviu.2024.104199_b1 Ben (10.1016/j.cviu.2024.104199_b4) 2021; 24 10.1016/j.cviu.2024.104199_b12 10.1016/j.cviu.2024.104199_b34 Tarvainen (10.1016/j.cviu.2024.104199_b35) 2017; 30 Zhu (10.1016/j.cviu.2024.104199_b48) 2023; 26 10.1016/j.cviu.2024.104199_b5 Lin (10.1016/j.cviu.2024.104199_b20) 2004 10.1016/j.cviu.2024.104199_b30 Zhu (10.1016/j.cviu.2024.104199_b47) 2022 Arjovsky (10.1016/j.cviu.2024.104199_b3) 2017 10.1016/j.cviu.2024.104199_b7 Yang (10.1016/j.cviu.2024.104199_b41) 2024 10.1016/j.cviu.2024.104199_b6 Nguyen (10.1016/j.cviu.2024.104199_b23) 2022 Radford (10.1016/j.cviu.2024.104199_b28) 2015 Prakash (10.1016/j.cviu.2024.104199_b26) 2021; 30 Song (10.1016/j.cviu.2024.104199_b32) 2022 10.1016/j.cviu.2024.104199_b17 10.1016/j.cviu.2024.104199_b39 10.1016/j.cviu.2024.104199_b13 10.1016/j.cviu.2024.104199_b14 Soni (10.1016/j.cviu.2024.104199_b33) 2023; 220 10.1016/j.cviu.2024.104199_b15 10.1016/j.cviu.2024.104199_b37 10.1016/j.cviu.2024.104199_b38 Yang (10.1016/j.cviu.2024.104199_b43) 2022; 186 10.1016/j.cviu.2024.104199_b22 10.1016/j.cviu.2024.104199_b44 Vaswani (10.1016/j.cviu.2024.104199_b36) 2017; 30 10.1016/j.cviu.2024.104199_b40 Yang (10.1016/j.cviu.2024.104199_b42) 2023; 79 Anderson (10.1016/j.cviu.2024.104199_b2) 2016 Lee (10.1016/j.cviu.2024.104199_b18) 2013 Dosovitskiy (10.1016/j.cviu.2024.104199_b9) 2020 Li (10.1016/j.cviu.2024.104199_b19) 2020; 100
References_xml	– reference: Karpathy, A., Fei-Fei, L., 2015. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3128–3137. – start-page: 896 year: 2013 ident: b18 article-title: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks publication-title: Workshop on Challenges in Representation Learning, Vol. 3 – reference: Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318. – reference: Agarwal, A., Lavie, A., 2008. Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the Third Workshop on Statistical Machine Translation. pp. 115–118. – start-page: 74 year: 2004 end-page: 81 ident: b20 article-title: Rouge: A package for automatic evaluation of summaries publication-title: Text Summarization Branches Out – volume: 30 year: 2017 ident: b36 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – reference: Chen, X., Jiang, M., Zhao, Q., 2021. Self-distillation for few-shot image captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 545–555. – year: 2024 ident: b16 article-title: Semi-supervised image captioning by adversarially propagating labeled data publication-title: IEEE Access – year: 2018 ident: b8 article-title: Bert: Pre-training of deep bidirectional transformers for language understanding – reference: Huang, L., Wang, W., Chen, J., Wei, X.Y., 2019. Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4634–4643. – reference: Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G., 2019. Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10323–10332. – volume: 186 start-page: 190 year: 2022 end-page: 200 ident: b43 article-title: Meta captioning: A meta learning based remote sensing image captioning framework publication-title: ISPRS J. Photogramm. Remote Sens. – year: 2020 ident: b9 article-title: An image is worth 16x16 words: Transformers for image recognition at scale – volume: 27 year: 2014 ident: b11 article-title: Generative adversarial nets publication-title: Adv. Neural Inf. Process. Syst. – reference: Gu, J., Joty, S., Cai, J., Wang, G., 2018. Unpaired image captioning by language pivoting. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 503–519. – reference: Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H., 2018. Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4320–4328. – volume: 100 year: 2020 ident: b19 article-title: Semi-supervised cross-modal image generation with generative adversarial networks publication-title: Pattern Recognit. – reference: Song, Y., Chen, S., Zhao, Y., Jin, Q., 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 784–792. – volume: 220 year: 2023 ident: b33 article-title: Diagnosis and prognosis of incipient faults and insulation status for asset management of power transformer using fuzzy logic controller & fuzzy clustering means publication-title: Electr. Power Syst. Res. – reference: Xie, Q., Luong, M.T., Hovy, E., Le, Q.V., 2020. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10687–10698. – year: 2024 ident: b41 article-title: Semi-supervised image captioning considering wasserstein graph matching – start-page: 289 year: 2020 end-page: 292 ident: b45 article-title: SVGAN: Semi-supervised generative adversarial network for image captioning publication-title: 2020 IEEE Conference on Telecommunications, Optics and Computer Science – reference: Wang, Y., Cao, Y., Zha, Z.J., Zhang, J., Xiong, Z., 2020. Deep degradation prior for low-quality image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11049–11058. – start-page: 214 year: 2017 end-page: 223 ident: b3 article-title: Wasserstein generative adversarial networks publication-title: International Conference on Machine Learning – reference: Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232. – year: 2022 ident: b32 article-title: Memorial gan with joint semantic optimization for unpaired image captioning publication-title: IEEE Trans. Cybern. – reference: Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R., 2020. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10578–10587. – reference: Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S., 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2641–2649. – reference: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695. – volume: 30 start-page: 9220 year: 2021 end-page: 9230 ident: b26 article-title: It GAN do better: GAN-based detection of objects on images with varying quality publication-title: IEEE Trans. Image Process. – reference: Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving Language Understanding with Unsupervised Learning. Technical Report, OpenAI. – reference: Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022. – start-page: 382 year: 2016 end-page: 398 ident: b2 article-title: Spice: Semantic propositional image caption evaluation publication-title: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, the Netherlands, October 11-14, 2016, Proceedings, Part V 14 – start-page: 167 year: 2022 end-page: 184 ident: b23 article-title: Grit: Faster and better image captioning transformer using dual visual features publication-title: European Conference on Computer Vision – reference: Vedantam, R., Lawrence Zitnick, C., Parikh, D., 2015. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4566–4575. – reference: Wang, Y., Xu, J., Sun, Y., 2022. End-to-end transformer based model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. pp. 2585–2594. – reference: Ben, H., Wang, S., Wang, M., Hong, R., 2024. Pseudo Content Hallucination for Unpaired Image Captioning. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 320–329. – volume: 30 year: 2017 ident: b35 article-title: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results publication-title: Adv. Neural Inf. Process. Syst. – year: 2015 ident: b28 article-title: Unsupervised representation learning with deep convolutional generative adversarial networks – reference: Feng, Y., Ma, L., Liu, W., Luo, J., 2019. Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4125–4134. – reference: Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C., 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16515–16525. – start-page: 740 year: 2014 end-page: 755 ident: b21 article-title: Microsoft coco: Common objects in context publication-title: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 – start-page: 8748 year: 2021 end-page: 8763 ident: b27 article-title: Learning transferable visual models from natural language supervision publication-title: International Conference on Machine Learning – volume: 26 start-page: 379 year: 2023 end-page: 393 ident: b48 article-title: Prompt-based learning for unpaired image captioning publication-title: IEEE Trans. Multimed. – volume: 79 year: 2023 ident: b42 article-title: Fast RF-UIC: A fast unsupervised image captioning model publication-title: Displays – reference: Laina, I., Rupprecht, C., Navab, N., 2019. Towards unsupervised image captioning with shared multimodal embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7414–7424. – year: 2022 ident: b47 article-title: Unpaired image captioning by image-level weakly-supervised visual concept recognition publication-title: IEEE Trans. Multimed. – volume: 24 start-page: 904 year: 2021 end-page: 916 ident: b4 article-title: Unpaired image captioning with semantic-constrained self-learning publication-title: IEEE Trans. Multimed. – start-page: 167 year: 2022 ident: 10.1016/j.cviu.2024.104199_b23 article-title: Grit: Faster and better image captioning transformer using dual visual features – ident: 10.1016/j.cviu.2024.104199_b34 doi: 10.1109/CVPR52688.2022.01602 – ident: 10.1016/j.cviu.2024.104199_b29 – ident: 10.1016/j.cviu.2024.104199_b24 doi: 10.3115/1073083.1073135 – start-page: 74 year: 2004 ident: 10.1016/j.cviu.2024.104199_b20 article-title: Rouge: A package for automatic evaluation of summaries – volume: 24 start-page: 904 year: 2021 ident: 10.1016/j.cviu.2024.104199_b4 article-title: Unpaired image captioning with semantic-constrained self-learning publication-title: IEEE Trans. Multimed. doi: 10.1109/TMM.2021.3060948 – start-page: 740 year: 2014 ident: 10.1016/j.cviu.2024.104199_b21 article-title: Microsoft coco: Common objects in context – volume: 220 year: 2023 ident: 10.1016/j.cviu.2024.104199_b33 article-title: Diagnosis and prognosis of incipient faults and insulation status for asset management of power transformer using fuzzy logic controller & fuzzy clustering means publication-title: Electr. Power Syst. Res. doi: 10.1016/j.epsr.2023.109256 – ident: 10.1016/j.cviu.2024.104199_b6 doi: 10.1109/WACV48630.2021.00059 – ident: 10.1016/j.cviu.2024.104199_b12 doi: 10.1007/978-3-030-01246-5_31 – ident: 10.1016/j.cviu.2024.104199_b14 doi: 10.1109/ICCV.2019.00473 – start-page: 214 year: 2017 ident: 10.1016/j.cviu.2024.104199_b3 article-title: Wasserstein generative adversarial networks – ident: 10.1016/j.cviu.2024.104199_b44 doi: 10.1109/CVPR.2018.00454 – year: 2018 ident: 10.1016/j.cviu.2024.104199_b8 – year: 2015 ident: 10.1016/j.cviu.2024.104199_b28 – year: 2022 ident: 10.1016/j.cviu.2024.104199_b47 article-title: Unpaired image captioning by image-level weakly-supervised visual concept recognition publication-title: IEEE Trans. Multimed. – start-page: 896 year: 2013 ident: 10.1016/j.cviu.2024.104199_b18 article-title: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks – volume: 30 year: 2017 ident: 10.1016/j.cviu.2024.104199_b36 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – volume: 27 year: 2014 ident: 10.1016/j.cviu.2024.104199_b11 article-title: Generative adversarial nets publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.cviu.2024.104199_b5 doi: 10.1145/3652583.3658080 – ident: 10.1016/j.cviu.2024.104199_b25 doi: 10.1109/ICCV.2015.303 – volume: 30 year: 2017 ident: 10.1016/j.cviu.2024.104199_b35 article-title: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results publication-title: Adv. Neural Inf. Process. Syst. – ident: 10.1016/j.cviu.2024.104199_b46 doi: 10.1109/ICCV.2017.244 – ident: 10.1016/j.cviu.2024.104199_b15 doi: 10.1109/CVPR.2015.7298932 – volume: 30 start-page: 9220 year: 2021 ident: 10.1016/j.cviu.2024.104199_b26 article-title: It GAN do better: GAN-based detection of objects on images with varying quality publication-title: IEEE Trans. Image Process. doi: 10.1109/TIP.2021.3124155 – ident: 10.1016/j.cviu.2024.104199_b31 doi: 10.1145/3343031.3350996 – year: 2024 ident: 10.1016/j.cviu.2024.104199_b41 – ident: 10.1016/j.cviu.2024.104199_b40 doi: 10.1109/CVPR42600.2020.01070 – ident: 10.1016/j.cviu.2024.104199_b39 doi: 10.1609/aaai.v36i3.20160 – ident: 10.1016/j.cviu.2024.104199_b7 doi: 10.1109/CVPR42600.2020.01059 – ident: 10.1016/j.cviu.2024.104199_b30 doi: 10.1109/CVPR52688.2022.01042 – volume: 26 start-page: 379 year: 2023 ident: 10.1016/j.cviu.2024.104199_b48 article-title: Prompt-based learning for unpaired image captioning publication-title: IEEE Trans. Multimed. doi: 10.1109/TMM.2023.3265842 – ident: 10.1016/j.cviu.2024.104199_b17 doi: 10.1109/ICCV.2019.00751 – year: 2022 ident: 10.1016/j.cviu.2024.104199_b32 article-title: Memorial gan with joint semantic optimization for unpaired image captioning publication-title: IEEE Trans. Cybern. – ident: 10.1016/j.cviu.2024.104199_b10 doi: 10.1109/CVPR.2019.00425 – ident: 10.1016/j.cviu.2024.104199_b38 doi: 10.1109/CVPR42600.2020.01106 – ident: 10.1016/j.cviu.2024.104199_b1 doi: 10.3115/1626394.1626406 – ident: 10.1016/j.cviu.2024.104199_b13 doi: 10.1109/ICCV.2019.01042 – volume: 79 year: 2023 ident: 10.1016/j.cviu.2024.104199_b42 article-title: Fast RF-UIC: A fast unsupervised image captioning model publication-title: Displays doi: 10.1016/j.displa.2023.102490 – start-page: 382 year: 2016 ident: 10.1016/j.cviu.2024.104199_b2 article-title: Spice: Semantic propositional image caption evaluation – year: 2024 ident: 10.1016/j.cviu.2024.104199_b16 article-title: Semi-supervised image captioning by adversarially propagating labeled data publication-title: IEEE Access – start-page: 289 year: 2020 ident: 10.1016/j.cviu.2024.104199_b45 article-title: SVGAN: Semi-supervised generative adversarial network for image captioning – volume: 100 year: 2020 ident: 10.1016/j.cviu.2024.104199_b19 article-title: Semi-supervised cross-modal image generation with generative adversarial networks publication-title: Pattern Recognit. doi: 10.1016/j.patcog.2019.107085 – volume: 186 start-page: 190 year: 2022 ident: 10.1016/j.cviu.2024.104199_b43 article-title: Meta captioning: A meta learning based remote sensing image captioning framework publication-title: ISPRS J. Photogramm. Remote Sens. doi: 10.1016/j.isprsjprs.2022.02.001 – year: 2020 ident: 10.1016/j.cviu.2024.104199_b9 – ident: 10.1016/j.cviu.2024.104199_b37 doi: 10.1109/CVPR.2015.7299087 – ident: 10.1016/j.cviu.2024.104199_b22 doi: 10.1109/ICCV48922.2021.00986 – start-page: 8748 year: 2021 ident: 10.1016/j.cviu.2024.104199_b27 article-title: Learning transferable visual models from natural language supervision
SSID	ssj0011491
Score	2.450164
Snippet	Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such...
SourceID	crossref elsevier
SourceType	Enrichment Source Index Database Publisher
StartPage	104199
SubjectTerms	CLIP Generative adversarial network Image captioning Semi-supervised Transformer
Title	Generative adversarial network for semi-supervised image captioning
URI	https://dx.doi.org/10.1016/j.cviu.2024.104199
Volume	249
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELaqssDAo4AojyoDGzLNw3mNVUVVQHSBSt0i27GlIBoi0jLy27mLnapIiIFssXxR9Nk-39l33xFyrRUXITjKFDbPkLI4VJRrLqmX60hpLqIkxWzkp1k0nbOHRbjokHGbC4NhlVb3G53eaGvbMrRoDquiGD6D4xIHeITBmvtBzChnLEb-_NuvTZgHmPtN1TzsTLG3TZwxMV7ys1iDj-gzvOo0_K-_bE5bG87kkOxbS9EZmZ85Ih1V9siBtRoduyZraGoLM7RtPbK3xTJ4TMaGWhr1msOx_nLNcdY5pYkAd8BsdWq1LGi9rlBz1PD5Ygl6xpG8sue1J2Q-uXsZT6mtnUBl4LorKlQeA9Qi0fB4jdkAbjBPYy6UyyOmWBqokHFPeIIlnoSVDRBGeR5KLXLhBqekW76X6ow4ifYFWGUq9SXSC8ZcJbkLQqn0ua99v0-8FrRMWmJxrG_xlrURZK8ZAp0h0JkBuk9uNjKVodX4s3fYjkX2Y3JkoPf_kDv_p9wF2cU3E7VySbqrj7W6AttjJQbN5BqQndH943T2DY1G2dQ
linkProvider	Elsevier
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT8MwDLbGdgAOPAaI8eyBG6rWR_o6ThNTxx4XNmm3KklTqYiVim78fpwlnYaEdqDHtK4q1_lix85ngKdMUOZhoGzi4umZJPCESTPKTTvNfJFR5oeRPI08mfrxnLwuvEUD-vVZGFlWqbFfYfoGrfVIV2uzW-Z59w0Dl8CVWxhkkx8MDqAl2anQ2Fu94SiebpMJGATYqvRQbskRR5-dUWVe_DtfY5joEJntVBSwf6xPO2vO4AxOtLNo9NT3nENDFG041Y6joadlhUN1b4Z6rA3HO0SDF9BX7NIS2gwqWzBXVBqeUagicAM9V6MSy9ys1qUEjwpfny8RagxOS71lewnzwcusH5u6fYLJXctamUykAWqbhRle9sZzwEiYRgFlwqI-ESRyhUeozWxGQpvj5EYt-mnq8YylzHKvoFl8FuIajDBzGDpmInK4ZBgMqAhTC4Ui7lAnc5wO2LXSEq65xWWLi4-kLiJ7T6SiE6noRCm6A89bmVIxa-x92qv_RfLLPhKE_j1yN_-Ue4TDeDYZJ-PhdHQLR_KOKmK5g-bqay3u0RVZsQdtaj8Qq9yF
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Generative+adversarial+network+for+semi-supervised+image+captioning&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Liang%2C+Xu&rft.au=Li%2C+Chen&rft.au=Tian%2C+Lihua&rft.date=2024-12-01&rft.pub=Elsevier+Inc&rft.issn=1077-3142&rft.volume=249&rft_id=info:doi/10.1016%2Fj.cviu.2024.104199&rft.externalDocID=S1077314224002807
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon