Generative adversarial network for semi-supervised image captioning

Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve th...

Full description

Saved in:
Bibliographic Details
Published inComputer vision and image understanding Vol. 249; p. 104199
Main Authors Liang, Xu, Li, Chen, Tian, Lihua
Format Journal Article
LanguageEnglish
Published Elsevier Inc 01.12.2024
Subjects
Online AccessGet full text
ISSN1077-3142
DOI10.1016/j.cviu.2024.104199

Cover

Loading…
Abstract Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks. •The proposed method is to generate images for captions instead of captions for images.•CLIP is used to constraint generator to associate images with captions.•Parameter updates can be completed through backpropagation.
AbstractList Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such datasets necessitates considerable temporal and human resources. Therefore, we propose a new semi-supervised image captioning algorithm to solve this problem. The proposed method uses a generative adversarial network to generate images that match captions, and uses these generated images and captions as new training data. This avoids the error accumulation problem when generating pseudo captions with autoregressive method and the network can directly perform backpropagation. At the same time, in order to ensure the correlation between the generated images and captions, we introduced the CLIP model for constraints. The CLIP model has been pre-trained on a large amount of image–text data, so it shows excellent performance in semantic alignment of images and text. To verify the effectiveness of our method, we validate on MSCOCO offline “Karpathy” test split. Experiment results show that our method can significantly improve the performance of the model when using 1% paired data, with the CIDEr score increasing from 69.5% to 77.7%. This shows that our method can effectively utilize unlabeled data for image caption tasks. •The proposed method is to generate images for captions instead of captions for images.•CLIP is used to constraint generator to associate images with captions.•Parameter updates can be completed through backpropagation.
ArticleNumber 104199
Author Li, Chen
Liang, Xu
Tian, Lihua
Author_xml – sequence: 1
  givenname: Xu
  orcidid: 0009-0008-0045-0032
  surname: Liang
  fullname: Liang, Xu
  email: liangxu@stu.xjtu.edu.cn
– sequence: 2
  givenname: Chen
  surname: Li
  fullname: Li, Chen
  email: cclidd@xjtu.edu.cn
– sequence: 3
  givenname: Lihua
  surname: Tian
  fullname: Tian, Lihua
  email: lhtian@xjtu.edu.cn
BookMark eNp9kE1Lw0AQhvdQwbb6BzzlD6TuJJtuAl6kaBUKXvS87MekTG03ZXcb8d-bUE89dC4DA8_LvM-MTXznkbEH4AvgsHzcLWxPp0XBCzEcBDTNhE2BS5mXIIpbNotxxzmAaGDKVmv0GHSiHjPtegxRB9L7zGP66cJ31nYhi3igPJ6OGHqK6DI66C1mVh8TdZ789o7dtHof8f5_z9nX68vn6i3ffKzfV8-b3Jacp9ygk602pm6HASmgqhoudSO1Qa6XAkVTYiU0GDCiBgt1WXO5dK6yrXGGl3NWn3Nt6GIM2CpLSY9PpKBpr4CrUYDaqVGAGgWos4ABLS7QYxhqhN_r0NMZwqFUTxhUtITeoqOANinX0TX8D2tpeuE
CitedBy_id crossref_primary_10_3390_foods14020286
Cites_doi 10.1109/CVPR52688.2022.01602
10.3115/1073083.1073135
10.1109/TMM.2021.3060948
10.1016/j.epsr.2023.109256
10.1109/WACV48630.2021.00059
10.1007/978-3-030-01246-5_31
10.1109/ICCV.2019.00473
10.1109/CVPR.2018.00454
10.1145/3652583.3658080
10.1109/ICCV.2015.303
10.1109/ICCV.2017.244
10.1109/CVPR.2015.7298932
10.1109/TIP.2021.3124155
10.1145/3343031.3350996
10.1109/CVPR42600.2020.01070
10.1609/aaai.v36i3.20160
10.1109/CVPR42600.2020.01059
10.1109/CVPR52688.2022.01042
10.1109/TMM.2023.3265842
10.1109/ICCV.2019.00751
10.1109/CVPR.2019.00425
10.1109/CVPR42600.2020.01106
10.3115/1626394.1626406
10.1109/ICCV.2019.01042
10.1016/j.displa.2023.102490
10.1016/j.patcog.2019.107085
10.1016/j.isprsjprs.2022.02.001
10.1109/CVPR.2015.7299087
10.1109/ICCV48922.2021.00986
ContentType Journal Article
Copyright 2024 Elsevier Inc.
Copyright_xml – notice: 2024 Elsevier Inc.
DBID AAYXX
CITATION
DOI 10.1016/j.cviu.2024.104199
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Engineering
Computer Science
ExternalDocumentID 10_1016_j_cviu_2024_104199
S1077314224002807
GroupedDBID --K
--M
-~X
.DC
.~1
0R~
1B1
1~.
1~5
29F
4.4
457
4G.
5GY
5VS
6TJ
7-5
71M
8P~
AABNK
AACTN
AAEDT
AAEDW
AAIKC
AAIKJ
AAKOC
AALRI
AAMNW
AAOAW
AAQFI
AAQXK
AAXKI
AAXUO
AAYFN
ABBOA
ABEFU
ABFNM
ABJNI
ABMAC
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACZNC
ADBBV
ADEZE
ADFGL
ADJOM
ADMUD
ADTZH
AEBSH
AECPX
AEKER
AENEX
AFJKZ
AFKWA
AFTJW
AGHFR
AGUBO
AGYEJ
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
AJOXV
AKRWK
ALMA_UNASSIGNED_HOLDINGS
AMFUW
AMRAJ
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CAG
COF
CS3
DM4
DU5
EBS
EFBJH
EJD
EO8
EO9
EP2
EP3
F0J
F5P
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HF~
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
LG5
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
RNS
ROL
RPZ
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TN5
XPP
ZMT
~G-
AATTM
AAYWO
AAYXX
ABWVN
ACRPL
ACVFH
ADCNI
ADNMO
AEIPS
AEUPX
AFPUW
AGCQF
AGQPQ
AGRNS
AIGII
AIIUN
AKBMS
AKYEP
ANKPU
APXCP
BNPGV
CITATION
ID FETCH-LOGICAL-c300t-bed7fabb8ffff174155907a97abe0a64e493e54a1b1b481c1838076dd5cfbdb03
IEDL.DBID .~1
ISSN 1077-3142
IngestDate Tue Jul 01 04:32:11 EDT 2025
Thu Apr 24 22:58:52 EDT 2025
Sat Nov 23 15:55:06 EST 2024
IsPeerReviewed true
IsScholarly true
Keywords Transformer
Image captioning
Semi-supervised
Generative adversarial network
CLIP
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c300t-bed7fabb8ffff174155907a97abe0a64e493e54a1b1b481c1838076dd5cfbdb03
ORCID 0009-0008-0045-0032
ParticipantIDs crossref_citationtrail_10_1016_j_cviu_2024_104199
crossref_primary_10_1016_j_cviu_2024_104199
elsevier_sciencedirect_doi_10_1016_j_cviu_2024_104199
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate December 2024
2024-12-00
PublicationDateYYYYMMDD 2024-12-01
PublicationDate_xml – month: 12
  year: 2024
  text: December 2024
PublicationDecade 2020
PublicationTitle Computer vision and image understanding
PublicationYear 2024
Publisher Elsevier Inc
Publisher_xml – name: Elsevier Inc
References Anderson, Fernando, Johnson, Gould (b2) 2016
Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S., 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2641–2649.
Soni, Mehta (b33) 2023; 220
Lin (b20) 2004
Tarvainen, Valpola (b35) 2017; 30
Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H., 2018. Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4320–4328.
Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving Language Understanding with Unsupervised Learning. Technical Report, OpenAI.
Prakash, Karam (b26) 2021; 30
Song, Y., Chen, S., Zhao, Y., Jin, Q., 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 784–792.
Zhu, Wang, Luo, Sun, Zheng, Wang, Chen (b47) 2022
Song, Guo, Zhou, Xu, Wang (b32) 2022
Wang, Y., Xu, J., Sun, Y., 2022. End-to-end transformer based model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. pp. 2585–2594.
Yang (b41) 2024
Laina, I., Rupprecht, C., Navab, N., 2019. Towards unsupervised image captioning with shared multimodal embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7414–7424.
Gu, J., Joty, S., Cai, J., Wang, G., 2018. Unpaired image captioning by language pivoting. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 503–519.
Wang, Y., Cao, Y., Zha, Z.J., Zhang, J., Xiong, Z., 2020. Deep degradation prior for low-quality image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11049–11058.
Zhu, Wang, Zhu, Sun, Zheng, Wang, Chen (b48) 2023; 26
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695.
Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232.
Karpathy, A., Fei-Fei, L., 2015. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3128–3137.
Ben, H., Wang, S., Wang, M., Hong, R., 2024. Pseudo Content Hallucination for Unpaired Image Captioning. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 320–329.
Xie, Q., Luong, M.T., Hovy, E., Le, Q.V., 2020. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10687–10698.
Huang, L., Wang, W., Chen, J., Wei, X.Y., 2019. Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4634–4643.
Kim, Oh, Choi, Kweon (b16) 2024
Ben, Pan, Li, Yao, Hong, Wang, Mei (b4) 2021; 24
Nguyen, Suganuma, Okatani (b23) 2022
Zhang, Zeng, He, Liu (b45) 2020
Lee (b18) 2013
Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R., 2020. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10578–10587.
Vedantam, R., Lawrence Zitnick, C., Parikh, D., 2015. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4566–4575.
Chen, X., Jiang, M., Zhao, Q., 2021. Self-distillation for few-shot image captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 545–555.
Dosovitskiy, Beyer, Kolesnikov, Weissenborn, Zhai, Unterthiner, Dehghani, Minderer, Heigold, Gelly (b9) 2020
Yang, Cui, Qin, Deng, Lan, Luo (b42) 2023; 79
Li, Du, He (b19) 2020; 100
Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G., 2019. Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10323–10332.
Yang, Ni, Ren (b43) 2022; 186
Lin, Maire, Belongie, Hays, Perona, Ramanan, Dollár, Zitnick (b21) 2014
Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C., 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16515–16525.
Devlin, Chang, Lee, Toutanova (b8) 2018
Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark (b27) 2021
Feng, Y., Ma, L., Liu, W., Luo, J., 2019. Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4125–4134.
Radford, Metz, Chintala (b28) 2015
Goodfellow, Pouget-Abadie, Mirza, Xu, Warde-Farley, Ozair, Courville, Bengio (b11) 2014; 27
Vaswani, Shazeer, Parmar, Uszkoreit, Jones, Gomez, Kaiser, Polosukhin (b36) 2017; 30
Agarwal, A., Lavie, A., 2008. Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the Third Workshop on Statistical Machine Translation. pp. 115–118.
Arjovsky, Chintala, Bottou (b3) 2017
Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318.
Devlin (10.1016/j.cviu.2024.104199_b8) 2018
10.1016/j.cviu.2024.104199_b29
10.1016/j.cviu.2024.104199_b24
Radford (10.1016/j.cviu.2024.104199_b27) 2021
10.1016/j.cviu.2024.104199_b46
Kim (10.1016/j.cviu.2024.104199_b16) 2024
10.1016/j.cviu.2024.104199_b25
Zhang (10.1016/j.cviu.2024.104199_b45) 2020
Goodfellow (10.1016/j.cviu.2024.104199_b11) 2014; 27
10.1016/j.cviu.2024.104199_b31
10.1016/j.cviu.2024.104199_b10
Lin (10.1016/j.cviu.2024.104199_b21) 2014
10.1016/j.cviu.2024.104199_b1
Ben (10.1016/j.cviu.2024.104199_b4) 2021; 24
10.1016/j.cviu.2024.104199_b12
10.1016/j.cviu.2024.104199_b34
Tarvainen (10.1016/j.cviu.2024.104199_b35) 2017; 30
Zhu (10.1016/j.cviu.2024.104199_b48) 2023; 26
10.1016/j.cviu.2024.104199_b5
Lin (10.1016/j.cviu.2024.104199_b20) 2004
10.1016/j.cviu.2024.104199_b30
Zhu (10.1016/j.cviu.2024.104199_b47) 2022
Arjovsky (10.1016/j.cviu.2024.104199_b3) 2017
10.1016/j.cviu.2024.104199_b7
Yang (10.1016/j.cviu.2024.104199_b41) 2024
10.1016/j.cviu.2024.104199_b6
Nguyen (10.1016/j.cviu.2024.104199_b23) 2022
Radford (10.1016/j.cviu.2024.104199_b28) 2015
Prakash (10.1016/j.cviu.2024.104199_b26) 2021; 30
Song (10.1016/j.cviu.2024.104199_b32) 2022
10.1016/j.cviu.2024.104199_b17
10.1016/j.cviu.2024.104199_b39
10.1016/j.cviu.2024.104199_b13
10.1016/j.cviu.2024.104199_b14
Soni (10.1016/j.cviu.2024.104199_b33) 2023; 220
10.1016/j.cviu.2024.104199_b15
10.1016/j.cviu.2024.104199_b37
10.1016/j.cviu.2024.104199_b38
Yang (10.1016/j.cviu.2024.104199_b43) 2022; 186
10.1016/j.cviu.2024.104199_b22
10.1016/j.cviu.2024.104199_b44
Vaswani (10.1016/j.cviu.2024.104199_b36) 2017; 30
10.1016/j.cviu.2024.104199_b40
Yang (10.1016/j.cviu.2024.104199_b42) 2023; 79
Anderson (10.1016/j.cviu.2024.104199_b2) 2016
Lee (10.1016/j.cviu.2024.104199_b18) 2013
Dosovitskiy (10.1016/j.cviu.2024.104199_b9) 2020
Li (10.1016/j.cviu.2024.104199_b19) 2020; 100
References_xml – reference: Karpathy, A., Fei-Fei, L., 2015. Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 3128–3137.
– start-page: 896
  year: 2013
  ident: b18
  article-title: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
  publication-title: Workshop on Challenges in Representation Learning, Vol. 3
– reference: Papineni, K., Roukos, S., Ward, T., Zhu, W.J., 2002. Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics. pp. 311–318.
– reference: Agarwal, A., Lavie, A., 2008. Meteor, m-bleu and m-ter: Evaluation metrics for high-correlation with human rankings of machine translation output. In: Proceedings of the Third Workshop on Statistical Machine Translation. pp. 115–118.
– start-page: 74
  year: 2004
  end-page: 81
  ident: b20
  article-title: Rouge: A package for automatic evaluation of summaries
  publication-title: Text Summarization Branches Out
– volume: 30
  year: 2017
  ident: b36
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: Chen, X., Jiang, M., Zhao, Q., 2021. Self-distillation for few-shot image captioning. In: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision. pp. 545–555.
– year: 2024
  ident: b16
  article-title: Semi-supervised image captioning by adversarially propagating labeled data
  publication-title: IEEE Access
– year: 2018
  ident: b8
  article-title: Bert: Pre-training of deep bidirectional transformers for language understanding
– reference: Huang, L., Wang, W., Chen, J., Wei, X.Y., 2019. Attention on attention for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 4634–4643.
– reference: Gu, J., Joty, S., Cai, J., Zhao, H., Yang, X., Wang, G., 2019. Unpaired image captioning via scene graph alignments. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10323–10332.
– volume: 186
  start-page: 190
  year: 2022
  end-page: 200
  ident: b43
  article-title: Meta captioning: A meta learning based remote sensing image captioning framework
  publication-title: ISPRS J. Photogramm. Remote Sens.
– year: 2020
  ident: b9
  article-title: An image is worth 16x16 words: Transformers for image recognition at scale
– volume: 27
  year: 2014
  ident: b11
  article-title: Generative adversarial nets
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: Gu, J., Joty, S., Cai, J., Wang, G., 2018. Unpaired image captioning by language pivoting. In: Proceedings of the European Conference on Computer Vision. ECCV, pp. 503–519.
– reference: Zhang, Y., Xiang, T., Hospedales, T.M., Lu, H., 2018. Deep mutual learning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4320–4328.
– volume: 100
  year: 2020
  ident: b19
  article-title: Semi-supervised cross-modal image generation with generative adversarial networks
  publication-title: Pattern Recognit.
– reference: Song, Y., Chen, S., Zhao, Y., Jin, Q., 2019. Unpaired cross-lingual image caption generation with self-supervised rewards. In: Proceedings of the 27th ACM International Conference on Multimedia. pp. 784–792.
– volume: 220
  year: 2023
  ident: b33
  article-title: Diagnosis and prognosis of incipient faults and insulation status for asset management of power transformer using fuzzy logic controller & fuzzy clustering means
  publication-title: Electr. Power Syst. Res.
– reference: Xie, Q., Luong, M.T., Hovy, E., Le, Q.V., 2020. Self-training with noisy student improves imagenet classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10687–10698.
– year: 2024
  ident: b41
  article-title: Semi-supervised image captioning considering wasserstein graph matching
– start-page: 289
  year: 2020
  end-page: 292
  ident: b45
  article-title: SVGAN: Semi-supervised generative adversarial network for image captioning
  publication-title: 2020 IEEE Conference on Telecommunications, Optics and Computer Science
– reference: Wang, Y., Cao, Y., Zha, Z.J., Zhang, J., Xiong, Z., 2020. Deep degradation prior for low-quality image classification. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11049–11058.
– start-page: 214
  year: 2017
  end-page: 223
  ident: b3
  article-title: Wasserstein generative adversarial networks
  publication-title: International Conference on Machine Learning
– reference: Zhu, J.Y., Park, T., Isola, P., Efros, A.A., 2017. Unpaired image-to-image translation using cycle-consistent adversarial networks. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2223–2232.
– year: 2022
  ident: b32
  article-title: Memorial gan with joint semantic optimization for unpaired image captioning
  publication-title: IEEE Trans. Cybern.
– reference: Cornia, M., Stefanini, M., Baraldi, L., Cucchiara, R., 2020. Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10578–10587.
– reference: Plummer, B.A., Wang, L., Cervantes, C.M., Caicedo, J.C., Hockenmaier, J., Lazebnik, S., 2015. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In: Proceedings of the IEEE International Conference on Computer Vision. pp. 2641–2649.
– reference: Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B., 2022. High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10684–10695.
– volume: 30
  start-page: 9220
  year: 2021
  end-page: 9230
  ident: b26
  article-title: It GAN do better: GAN-based detection of objects on images with varying quality
  publication-title: IEEE Trans. Image Process.
– reference: Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., 2018. Improving Language Understanding with Unsupervised Learning. Technical Report, OpenAI.
– reference: Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B., 2021. Swin transformer: Hierarchical vision transformer using shifted windows. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 10012–10022.
– start-page: 382
  year: 2016
  end-page: 398
  ident: b2
  article-title: Spice: Semantic propositional image caption evaluation
  publication-title: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, the Netherlands, October 11-14, 2016, Proceedings, Part V 14
– start-page: 167
  year: 2022
  end-page: 184
  ident: b23
  article-title: Grit: Faster and better image captioning transformer using dual visual features
  publication-title: European Conference on Computer Vision
– reference: Vedantam, R., Lawrence Zitnick, C., Parikh, D., 2015. Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. pp. 4566–4575.
– reference: Wang, Y., Xu, J., Sun, Y., 2022. End-to-end transformer based model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 36. pp. 2585–2594.
– reference: Ben, H., Wang, S., Wang, M., Hong, R., 2024. Pseudo Content Hallucination for Unpaired Image Captioning. In: Proceedings of the 2024 International Conference on Multimedia Retrieval. pp. 320–329.
– volume: 30
  year: 2017
  ident: b35
  article-title: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2015
  ident: b28
  article-title: Unsupervised representation learning with deep convolutional generative adversarial networks
– reference: Feng, Y., Ma, L., Liu, W., Luo, J., 2019. Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 4125–4134.
– reference: Tao, M., Tang, H., Wu, F., Jing, X.Y., Bao, B.K., Xu, C., 2022. Df-gan: A simple and effective baseline for text-to-image synthesis. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 16515–16525.
– start-page: 740
  year: 2014
  end-page: 755
  ident: b21
  article-title: Microsoft coco: Common objects in context
  publication-title: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13
– start-page: 8748
  year: 2021
  end-page: 8763
  ident: b27
  article-title: Learning transferable visual models from natural language supervision
  publication-title: International Conference on Machine Learning
– volume: 26
  start-page: 379
  year: 2023
  end-page: 393
  ident: b48
  article-title: Prompt-based learning for unpaired image captioning
  publication-title: IEEE Trans. Multimed.
– volume: 79
  year: 2023
  ident: b42
  article-title: Fast RF-UIC: A fast unsupervised image captioning model
  publication-title: Displays
– reference: Laina, I., Rupprecht, C., Navab, N., 2019. Towards unsupervised image captioning with shared multimodal embeddings. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 7414–7424.
– year: 2022
  ident: b47
  article-title: Unpaired image captioning by image-level weakly-supervised visual concept recognition
  publication-title: IEEE Trans. Multimed.
– volume: 24
  start-page: 904
  year: 2021
  end-page: 916
  ident: b4
  article-title: Unpaired image captioning with semantic-constrained self-learning
  publication-title: IEEE Trans. Multimed.
– start-page: 167
  year: 2022
  ident: 10.1016/j.cviu.2024.104199_b23
  article-title: Grit: Faster and better image captioning transformer using dual visual features
– ident: 10.1016/j.cviu.2024.104199_b34
  doi: 10.1109/CVPR52688.2022.01602
– ident: 10.1016/j.cviu.2024.104199_b29
– ident: 10.1016/j.cviu.2024.104199_b24
  doi: 10.3115/1073083.1073135
– start-page: 74
  year: 2004
  ident: 10.1016/j.cviu.2024.104199_b20
  article-title: Rouge: A package for automatic evaluation of summaries
– volume: 24
  start-page: 904
  year: 2021
  ident: 10.1016/j.cviu.2024.104199_b4
  article-title: Unpaired image captioning with semantic-constrained self-learning
  publication-title: IEEE Trans. Multimed.
  doi: 10.1109/TMM.2021.3060948
– start-page: 740
  year: 2014
  ident: 10.1016/j.cviu.2024.104199_b21
  article-title: Microsoft coco: Common objects in context
– volume: 220
  year: 2023
  ident: 10.1016/j.cviu.2024.104199_b33
  article-title: Diagnosis and prognosis of incipient faults and insulation status for asset management of power transformer using fuzzy logic controller & fuzzy clustering means
  publication-title: Electr. Power Syst. Res.
  doi: 10.1016/j.epsr.2023.109256
– ident: 10.1016/j.cviu.2024.104199_b6
  doi: 10.1109/WACV48630.2021.00059
– ident: 10.1016/j.cviu.2024.104199_b12
  doi: 10.1007/978-3-030-01246-5_31
– ident: 10.1016/j.cviu.2024.104199_b14
  doi: 10.1109/ICCV.2019.00473
– start-page: 214
  year: 2017
  ident: 10.1016/j.cviu.2024.104199_b3
  article-title: Wasserstein generative adversarial networks
– ident: 10.1016/j.cviu.2024.104199_b44
  doi: 10.1109/CVPR.2018.00454
– year: 2018
  ident: 10.1016/j.cviu.2024.104199_b8
– year: 2015
  ident: 10.1016/j.cviu.2024.104199_b28
– year: 2022
  ident: 10.1016/j.cviu.2024.104199_b47
  article-title: Unpaired image captioning by image-level weakly-supervised visual concept recognition
  publication-title: IEEE Trans. Multimed.
– start-page: 896
  year: 2013
  ident: 10.1016/j.cviu.2024.104199_b18
  article-title: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks
– volume: 30
  year: 2017
  ident: 10.1016/j.cviu.2024.104199_b36
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 27
  year: 2014
  ident: 10.1016/j.cviu.2024.104199_b11
  article-title: Generative adversarial nets
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.cviu.2024.104199_b5
  doi: 10.1145/3652583.3658080
– ident: 10.1016/j.cviu.2024.104199_b25
  doi: 10.1109/ICCV.2015.303
– volume: 30
  year: 2017
  ident: 10.1016/j.cviu.2024.104199_b35
  article-title: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results
  publication-title: Adv. Neural Inf. Process. Syst.
– ident: 10.1016/j.cviu.2024.104199_b46
  doi: 10.1109/ICCV.2017.244
– ident: 10.1016/j.cviu.2024.104199_b15
  doi: 10.1109/CVPR.2015.7298932
– volume: 30
  start-page: 9220
  year: 2021
  ident: 10.1016/j.cviu.2024.104199_b26
  article-title: It GAN do better: GAN-based detection of objects on images with varying quality
  publication-title: IEEE Trans. Image Process.
  doi: 10.1109/TIP.2021.3124155
– ident: 10.1016/j.cviu.2024.104199_b31
  doi: 10.1145/3343031.3350996
– year: 2024
  ident: 10.1016/j.cviu.2024.104199_b41
– ident: 10.1016/j.cviu.2024.104199_b40
  doi: 10.1109/CVPR42600.2020.01070
– ident: 10.1016/j.cviu.2024.104199_b39
  doi: 10.1609/aaai.v36i3.20160
– ident: 10.1016/j.cviu.2024.104199_b7
  doi: 10.1109/CVPR42600.2020.01059
– ident: 10.1016/j.cviu.2024.104199_b30
  doi: 10.1109/CVPR52688.2022.01042
– volume: 26
  start-page: 379
  year: 2023
  ident: 10.1016/j.cviu.2024.104199_b48
  article-title: Prompt-based learning for unpaired image captioning
  publication-title: IEEE Trans. Multimed.
  doi: 10.1109/TMM.2023.3265842
– ident: 10.1016/j.cviu.2024.104199_b17
  doi: 10.1109/ICCV.2019.00751
– year: 2022
  ident: 10.1016/j.cviu.2024.104199_b32
  article-title: Memorial gan with joint semantic optimization for unpaired image captioning
  publication-title: IEEE Trans. Cybern.
– ident: 10.1016/j.cviu.2024.104199_b10
  doi: 10.1109/CVPR.2019.00425
– ident: 10.1016/j.cviu.2024.104199_b38
  doi: 10.1109/CVPR42600.2020.01106
– ident: 10.1016/j.cviu.2024.104199_b1
  doi: 10.3115/1626394.1626406
– ident: 10.1016/j.cviu.2024.104199_b13
  doi: 10.1109/ICCV.2019.01042
– volume: 79
  year: 2023
  ident: 10.1016/j.cviu.2024.104199_b42
  article-title: Fast RF-UIC: A fast unsupervised image captioning model
  publication-title: Displays
  doi: 10.1016/j.displa.2023.102490
– start-page: 382
  year: 2016
  ident: 10.1016/j.cviu.2024.104199_b2
  article-title: Spice: Semantic propositional image caption evaluation
– year: 2024
  ident: 10.1016/j.cviu.2024.104199_b16
  article-title: Semi-supervised image captioning by adversarially propagating labeled data
  publication-title: IEEE Access
– start-page: 289
  year: 2020
  ident: 10.1016/j.cviu.2024.104199_b45
  article-title: SVGAN: Semi-supervised generative adversarial network for image captioning
– volume: 100
  year: 2020
  ident: 10.1016/j.cviu.2024.104199_b19
  article-title: Semi-supervised cross-modal image generation with generative adversarial networks
  publication-title: Pattern Recognit.
  doi: 10.1016/j.patcog.2019.107085
– volume: 186
  start-page: 190
  year: 2022
  ident: 10.1016/j.cviu.2024.104199_b43
  article-title: Meta captioning: A meta learning based remote sensing image captioning framework
  publication-title: ISPRS J. Photogramm. Remote Sens.
  doi: 10.1016/j.isprsjprs.2022.02.001
– year: 2020
  ident: 10.1016/j.cviu.2024.104199_b9
– ident: 10.1016/j.cviu.2024.104199_b37
  doi: 10.1109/CVPR.2015.7299087
– ident: 10.1016/j.cviu.2024.104199_b22
  doi: 10.1109/ICCV48922.2021.00986
– start-page: 8748
  year: 2021
  ident: 10.1016/j.cviu.2024.104199_b27
  article-title: Learning transferable visual models from natural language supervision
SSID ssj0011491
Score 2.450164
Snippet Traditional supervised image captioning methods usually rely on a large number of images and paired captions for training. However, the creation of such...
SourceID crossref
elsevier
SourceType Enrichment Source
Index Database
Publisher
StartPage 104199
SubjectTerms CLIP
Generative adversarial network
Image captioning
Semi-supervised
Transformer
Title Generative adversarial network for semi-supervised image captioning
URI https://dx.doi.org/10.1016/j.cviu.2024.104199
Volume 249
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELaqssDAo4AojyoDGzLNw3mNVUVVQHSBSt0i27GlIBoi0jLy27mLnapIiIFssXxR9Nk-39l33xFyrRUXITjKFDbPkLI4VJRrLqmX60hpLqIkxWzkp1k0nbOHRbjokHGbC4NhlVb3G53eaGvbMrRoDquiGD6D4xIHeITBmvtBzChnLEb-_NuvTZgHmPtN1TzsTLG3TZwxMV7ys1iDj-gzvOo0_K-_bE5bG87kkOxbS9EZmZ85Ih1V9siBtRoduyZraGoLM7RtPbK3xTJ4TMaGWhr1msOx_nLNcdY5pYkAd8BsdWq1LGi9rlBz1PD5Ygl6xpG8sue1J2Q-uXsZT6mtnUBl4LorKlQeA9Qi0fB4jdkAbjBPYy6UyyOmWBqokHFPeIIlnoSVDRBGeR5KLXLhBqekW76X6ow4ifYFWGUq9SXSC8ZcJbkLQqn0ua99v0-8FrRMWmJxrG_xlrURZK8ZAp0h0JkBuk9uNjKVodX4s3fYjkX2Y3JkoPf_kDv_p9wF2cU3E7VySbqrj7W6AttjJQbN5BqQndH943T2DY1G2dQ
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1LT8MwDLbGdgAOPAaI8eyBG6rWR_o6ThNTxx4XNmm3KklTqYiVim78fpwlnYaEdqDHtK4q1_lix85ngKdMUOZhoGzi4umZJPCESTPKTTvNfJFR5oeRPI08mfrxnLwuvEUD-vVZGFlWqbFfYfoGrfVIV2uzW-Z59w0Dl8CVWxhkkx8MDqAl2anQ2Fu94SiebpMJGATYqvRQbskRR5-dUWVe_DtfY5joEJntVBSwf6xPO2vO4AxOtLNo9NT3nENDFG041Y6joadlhUN1b4Z6rA3HO0SDF9BX7NIS2gwqWzBXVBqeUagicAM9V6MSy9ys1qUEjwpfny8RagxOS71lewnzwcusH5u6fYLJXctamUykAWqbhRle9sZzwEiYRgFlwqI-ESRyhUeozWxGQpvj5EYt-mnq8YylzHKvoFl8FuIajDBzGDpmInK4ZBgMqAhTC4Ui7lAnc5wO2LXSEq65xWWLi4-kLiJ7T6SiE6noRCm6A89bmVIxa-x92qv_RfLLPhKE_j1yN_-Ue4TDeDYZJ-PhdHQLR_KOKmK5g-bqay3u0RVZsQdtaj8Qq9yF
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Generative+adversarial+network+for+semi-supervised+image+captioning&rft.jtitle=Computer+vision+and+image+understanding&rft.au=Liang%2C+Xu&rft.au=Li%2C+Chen&rft.au=Tian%2C+Lihua&rft.date=2024-12-01&rft.pub=Elsevier+Inc&rft.issn=1077-3142&rft.volume=249&rft_id=info:doi/10.1016%2Fj.cviu.2024.104199&rft.externalDocID=S1077314224002807
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1077-3142&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1077-3142&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1077-3142&client=summon