From methods to datasets: A survey on Image-Caption Generators

Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating well-structured sentences requires a thorough understanding of language in a systematic and semantic way. Being able to describe the substance of an imag...

Full description

Saved in:
Bibliographic Details
Published inMultimedia tools and applications Vol. 83; no. 9; pp. 28077 - 28123
Main Authors Agarwal, Lakshita, Verma, Bindu
Format Journal Article
LanguageEnglish
Published New York Springer US 01.03.2024
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating well-structured sentences requires a thorough understanding of language in a systematic and semantic way. Being able to describe the substance of an image using well-structured phrases is a difficult undertaking, but it can have a significant impact in terms of assisting visually impaired people in better understanding the images’ content. Image captions has gained a lot of attention as a study subject for various computer vision and natural language processing (NLP) applications. The goal of image captions is to create logical and accurate natural language phrases that describes an image. It relies on the caption model to see items and appropriately characterise their relationships. Intuitively, it is also difficult for a machine to see a typical image in the same way that humans do. It does, however, provide the foundation for intelligent exploration in deep learning. In this review paper, we will focus on the latest in-depth advanced captions techniques for image captioning. This paper highlights related methodologies and focuses on aspects that are crucial in computer recognition, as well as on the numerous strategies and procedures being developed for the development of image captions. It was also observed that Recurrent neural networks (RNNs) are used in the bulk of research works (45%), followed by attention-based models (30%), transformer-based models (15%) and other methods (10%). An overview of the approaches utilised in image captioning research is discussed in this paper. Furthermore, the benefits and drawbacks of these methodologies are explored, as well as the most regularly used data sets and evaluation processes in this sector are being studied.
AbstractList Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating well-structured sentences requires a thorough understanding of language in a systematic and semantic way. Being able to describe the substance of an image using well-structured phrases is a difficult undertaking, but it can have a significant impact in terms of assisting visually impaired people in better understanding the images’ content. Image captions has gained a lot of attention as a study subject for various computer vision and natural language processing (NLP) applications. The goal of image captions is to create logical and accurate natural language phrases that describes an image. It relies on the caption model to see items and appropriately characterise their relationships. Intuitively, it is also difficult for a machine to see a typical image in the same way that humans do. It does, however, provide the foundation for intelligent exploration in deep learning. In this review paper, we will focus on the latest in-depth advanced captions techniques for image captioning. This paper highlights related methodologies and focuses on aspects that are crucial in computer recognition, as well as on the numerous strategies and procedures being developed for the development of image captions. It was also observed that Recurrent neural networks (RNNs) are used in the bulk of research works (45%), followed by attention-based models (30%), transformer-based models (15%) and other methods (10%). An overview of the approaches utilised in image captioning research is discussed in this paper. Furthermore, the benefits and drawbacks of these methodologies are explored, as well as the most regularly used data sets and evaluation processes in this sector are being studied.
Author Agarwal, Lakshita
Verma, Bindu
Author_xml – sequence: 1
  givenname: Lakshita
  surname: Agarwal
  fullname: Agarwal, Lakshita
  organization: Department of Information Technology, Delhi Technological University
– sequence: 2
  givenname: Bindu
  orcidid: 0000-0003-3534-3364
  surname: Verma
  fullname: Verma, Bindu
  email: bindu.cvision@gmail.com
  organization: Department of Information Technology, Delhi Technological University
BookMark eNp9kEFLAzEQhYNUsK3-AU8Bz9HJZHez60EoxdaC4EXPId1k65bupiaptP_erSsoHnqaGXjfvJk3IoPWtZaQaw63HEDeBc4hQQYoGM_SDNj-jAx5KgWTEvngT39BRiGsAToZJkPyMPOuoY2N784EGh01OupgY7inExp2_tMeqGvpotEry6Z6G-tumtvWeh2dD5fkvNKbYK9-6pi8zR5fp0_s-WW-mE6eWSl4EVlqtIVEZ1gWxmBusJS6yO1SAKbSIlawTIEbXegKdZUskSd5IZPKZKlMZWLFmNz0e7fefexsiGrtdr7tLBUWQqAELKBTYa8qvQvB20ptfd1of1Ac1DEn1eekupzUd05q30H5P6isoz7-Gb2uN6dR0aOh82lX1v9edYL6AojEfl0
CitedBy_id crossref_primary_10_1007_s00371_024_03790_9
Cites_doi 10.1609/aaai.v34i03.5655
10.1109/CVPR.2015.7298856
10.1016/j.patcog.2019.01.028
10.1109/ICCV.2019.00904
10.1007/978-3-030-24051-6_62
10.1109/ICCV.2019.01041
10.1007/978-3-319-54193-8_7
10.18653/v1/W17-3506
10.1109/TPAMI.2022.3148210
10.1109/ICCV.2015.291
10.1007/978-3-030-01225-0_13
10.1109/ICCV.2017.100
10.1016/j.neucom.2018.12.026
10.1609/aaai.v30i1.10475
10.1109/CVPR42600.2020.01098
10.18653/v1/2021.findings-emnlp.277
10.1007/978-3-030-01264-9_42
10.1162/tacl_a_00166
10.18653/v1/W15-2807
10.1109/CVPR.2016.9
10.1007/978-3-030-32692-0_77
10.1162/089892904322984526
10.1109/CVPR.2018.00583
10.1609/aaai.v34i07.7005
10.1088/1742-6596/1914/1/012053
10.1109/CVPR42600.2020.01059
10.3115/v1/W14-1602
10.1016/j.neucom.2018.05.080
10.1109/CVPR52688.2022.01750
10.1109/TIP.2020.3028651
10.1109/CVPR.2016.494
10.18653/v1/P17-2066
10.1109/CVPR.2015.7298932
10.3115/1073083.1073135
10.1155/2020/3062706
10.1109/CVPR.2019.00643
10.1109/CVPR.2017.131
10.1609/aaai.v35i2.16249
10.1109/CVPR.2018.00636
10.18653/v1/P18-1238
10.1007/978-3-030-58577-8_8
10.1109/TMM.2023.3241517
10.1609/aaai.v33i01.33018626
10.1109/ICCV.2015.277
10.3115/v1/W14-3348
10.1109/CVPR.2019.00425
10.1016/j.neucom.2017.07.014
10.1016/j.cviu.2009.03.008
10.1109/ICCV.2019.00902
10.1109/CVPR.2015.7299087
10.2139/ssrn.3368837
10.1109/CVPR46437.2021.00356
10.1109/ICIEA.2019.8833686
10.1109/CVPR42600.2020.00483
10.1109/IJCNN.1992.227117
10.1109/CVPR46437.2021.00553
10.1145/3126686.3126717
10.1609/aaai.v31i1.11237
10.1109/CVPR52688.2022.01745
10.1109/BigComp48618.2020.00-12
10.1609/aaai.v35i3.16328
10.1109/CVPR.2017.130
10.1007/978-3-319-46454-1_24
10.1007/978-3-030-69538-5_10
10.1109/TPAMI.2003.1227984
10.1109/TCSVT.2019.2947482
10.1007/978-3-030-58536-5_44
10.1109/TPAMI.2021.3119754
10.1109/CVPR.2019.00859
10.1609/aaai.v33i01.33018142
10.1007/978-3-319-10602-1_48
10.1145/3343031.3350943
10.1109/CVPR.2016.503
10.1109/TPAMI.2022.3159811
10.1093/jamia/ocv080
10.1109/ICCV.2017.524
10.1145/860435.860459
10.1109/CVPR.2016.8
10.1609/aaai.v35i3.16361
10.1162/tacl_a_00177
ContentType Journal Article
Copyright The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Copyright_xml – notice: The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
DBID AAYXX
CITATION
3V.
7SC
7WY
7WZ
7XB
87Z
8AL
8AO
8FD
8FE
8FG
8FK
8FL
8G5
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BEZIV
BGLVJ
CCPQU
DWQXO
FRNLG
F~G
GNUQQ
GUQSH
HCIFZ
JQ2
K60
K6~
K7-
L.-
L7M
L~C
L~D
M0C
M0N
M2O
MBDVC
P5Z
P62
PHGZM
PHGZT
PKEHL
PQBIZ
PQBZA
PQEST
PQGLB
PQQKQ
PQUKI
Q9U
DOI 10.1007/s11042-023-16560-x
DatabaseName CrossRef
ProQuest Central (Corporate)
Computer and Information Systems Abstracts
ABI/INFORM Collection
ABI/INFORM Global (PDF only)
ProQuest Central (purchase pre-March 2016)
ABI/INFORM Collection
Computing Database (Alumni Edition)
ProQuest Pharma Collection
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni) (purchase pre-March 2016)
ABI/INFORM Collection (Alumni)
ProQuest Research Library
ProQuest Central (Alumni)
ProQuest Central UK/Ireland
Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Central
Business Premium Collection
Technology Collection
ProQuest One Community College
ProQuest Central
Business Premium Collection (Alumni)
ABI/INFORM Global (Corporate)
ProQuest Central Student
ProQuest Research Library
ProQuest SciTech Premium Collection
ProQuest Computer Science Collection
ProQuest Business Collection (Alumni Edition)
ProQuest Business Collection
Computer Science Database
ABI/INFORM Professional Advanced
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
ABI/INFORM Global
Computing Database
Research Library
Research Library (Corporate)
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Premium
ProQuest One Academic (New)
ProQuest One Academic Middle East (New)
ProQuest One Business (OCUL)
ProQuest One Business (Alumni)
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Applied & Life Sciences
ProQuest One Academic
ProQuest One Academic UKI Edition
ProQuest Central Basic
DatabaseTitle CrossRef
ABI/INFORM Global (Corporate)
ProQuest Business Collection (Alumni Edition)
ProQuest One Business
Research Library Prep
Computer Science Database
ProQuest Central Student
Technology Collection
Technology Research Database
Computer and Information Systems Abstracts – Academic
ProQuest One Academic Middle East (New)
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest Computer Science Collection
Computer and Information Systems Abstracts
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
Research Library (Alumni Edition)
ProQuest Pharma Collection
ABI/INFORM Complete
ProQuest Central
ABI/INFORM Professional Advanced
ProQuest One Applied & Life Sciences
ProQuest Central Korea
ProQuest Research Library
ProQuest Central (New)
Advanced Technologies Database with Aerospace
ABI/INFORM Complete (Alumni Edition)
Advanced Technologies & Aerospace Collection
Business Premium Collection
ABI/INFORM Global
ProQuest Computing
ABI/INFORM Global (Alumni Edition)
ProQuest Central Basic
ProQuest Computing (Alumni Edition)
ProQuest One Academic Eastern Edition
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Business Collection
Computer and Information Systems Abstracts Professional
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest One Business (Alumni)
ProQuest One Academic
ProQuest One Academic (New)
ProQuest Central (Alumni)
Business Premium Collection (Alumni)
DatabaseTitleList
ABI/INFORM Global (Corporate)
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
EISSN 1573-7721
EndPage 28123
ExternalDocumentID 10_1007_s11042_023_16560_x
GroupedDBID -4Z
-59
-5G
-BR
-EM
-Y2
-~C
.4S
.86
.DC
.VR
06D
0R~
0VY
123
1N0
1SB
2.D
203
28-
29M
2J2
2JN
2JY
2KG
2LR
2P1
2VQ
2~H
30V
3EH
3V.
4.4
406
408
409
40D
40E
5QI
5VS
67Z
6NX
7WY
8AO
8FE
8FG
8FL
8G5
8UJ
95-
95.
95~
96X
AAAVM
AABHQ
AACDK
AAHNG
AAIAL
AAJBT
AAJKR
AANZL
AAOBN
AARHV
AARTL
AASML
AATNV
AATVU
AAUYE
AAWCG
AAYIU
AAYQN
AAYTO
AAYZH
ABAKF
ABBBX
ABBXA
ABDZT
ABECU
ABFTV
ABHLI
ABHQN
ABJNI
ABJOX
ABKCH
ABKTR
ABMNI
ABMQK
ABNWP
ABQBU
ABQSL
ABSXP
ABTEG
ABTHY
ABTKH
ABTMW
ABULA
ABUWG
ABWNU
ABXPI
ACAOD
ACBXY
ACDTI
ACGFO
ACGFS
ACHSB
ACHXU
ACKNC
ACMDZ
ACMLO
ACOKC
ACOMO
ACPIV
ACREN
ACSNA
ACZOJ
ADHHG
ADHIR
ADIMF
ADINQ
ADKNI
ADKPE
ADMLS
ADRFC
ADTPH
ADURQ
ADYFF
ADYOE
ADZKW
AEBTG
AEFIE
AEFQL
AEGAL
AEGNC
AEJHL
AEJRE
AEKMD
AEMSY
AENEX
AEOHA
AEPYU
AESKC
AETLH
AEVLU
AEXYK
AFBBN
AFEXP
AFGCZ
AFKRA
AFLOW
AFQWF
AFWTZ
AFYQB
AFZKB
AGAYW
AGDGC
AGGDS
AGJBK
AGMZJ
AGQEE
AGQMX
AGRTI
AGWIL
AGWZB
AGYKE
AHAVH
AHBYD
AHKAY
AHSBF
AHYZX
AIAKS
AIGIU
AIIXL
AILAN
AITGF
AJBLW
AJRNO
AJZVZ
ALMA_UNASSIGNED_HOLDINGS
ALWAN
AMKLP
AMTXH
AMXSW
AMYLF
AMYQR
AOCGG
ARAPS
ARCSS
ARMRJ
ASPBG
AVWKF
AXYYD
AYJHY
AZFZN
AZQEC
B-.
BA0
BBWZM
BDATZ
BENPR
BEZIV
BGLVJ
BGNMA
BPHCQ
BSONS
CAG
CCPQU
COF
CS3
CSCUP
DDRTE
DL5
DNIVK
DPUIP
DU5
DWQXO
EBLON
EBS
EIOEI
EJD
ESBYG
FEDTE
FERAY
FFXSO
FIGPU
FINBP
FNLPD
FRNLG
FRRFC
FSGXE
FWDCC
GGCAI
GGRSB
GJIRD
GNUQQ
GNWQR
GQ6
GQ7
GQ8
GROUPED_ABI_INFORM_COMPLETE
GUQSH
GXS
H13
HCIFZ
HF~
HG5
HG6
HMJXF
HQYDN
HRMNR
HVGLF
HZ~
I-F
I09
IHE
IJ-
IKXTQ
ITG
ITH
ITM
IWAJR
IXC
IXE
IZIGR
IZQ
I~X
I~Z
J-C
J0Z
JBSCW
JCJTX
JZLTJ
K60
K6V
K6~
K7-
KDC
KOV
KOW
LAK
LLZTM
M0C
M0N
M2O
M4Y
MA-
N2Q
N9A
NB0
NDZJH
NPVJJ
NQJWS
NU0
O9-
O93
O9G
O9I
O9J
OAM
OVD
P19
P2P
P62
P9O
PF0
PQBIZ
PQBZA
PQQKQ
PROAC
PT4
PT5
Q2X
QOK
QOS
R4E
R89
R9I
RHV
RNI
RNS
ROL
RPX
RSV
RZC
RZE
RZK
S16
S1Z
S26
S27
S28
S3B
SAP
SCJ
SCLPG
SCO
SDH
SDM
SHX
SISQX
SJYHP
SNE
SNPRN
SNX
SOHCF
SOJ
SPISZ
SRMVM
SSLCW
STPWE
SZN
T13
T16
TEORI
TH9
TSG
TSK
TSV
TUC
TUS
U2A
UG4
UOJIU
UTJUX
UZXMN
VC2
VFIZW
W23
W48
WK8
YLTOR
Z45
Z7R
Z7S
Z7W
Z7X
Z7Y
Z7Z
Z81
Z83
Z86
Z88
Z8M
Z8N
Z8Q
Z8R
Z8S
Z8T
Z8U
Z8W
Z92
ZMTXR
~EX
AAPKM
AAYXX
ABBRH
ABDBE
ABFSG
ACMFV
ACSTC
ADKFA
AEZWR
AFDZB
AFHIU
AFOHR
AHPBZ
AHWEU
AIXLP
ATHPR
AYFIA
CITATION
PHGZM
PHGZT
7SC
7XB
8AL
8FD
8FK
ABRTQ
JQ2
L.-
L7M
L~C
L~D
MBDVC
PKEHL
PQEST
PQGLB
PQUKI
Q9U
ID FETCH-LOGICAL-c319t-5dae04a62c9dd28d2c7a98eb30257e22f0b501da9af2af4b2148974fd657574e3
IEDL.DBID BENPR
ISSN 1573-7721
1380-7501
IngestDate Sat Aug 23 12:44:03 EDT 2025
Thu Apr 24 23:09:33 EDT 2025
Tue Jul 01 04:13:26 EDT 2025
Fri Feb 21 02:41:55 EST 2025
IsPeerReviewed true
IsScholarly true
Issue 9
Keywords Image- Caption Generator
Deep learning
Computer vision
Natural language processing
Intelligent exploration
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c319t-5dae04a62c9dd28d2c7a98eb30257e22f0b501da9af2af4b2148974fd657574e3
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ORCID 0000-0003-3534-3364
PQID 2933270290
PQPubID 54626
PageCount 47
ParticipantIDs proquest_journals_2933270290
crossref_primary_10_1007_s11042_023_16560_x
crossref_citationtrail_10_1007_s11042_023_16560_x
springer_journals_10_1007_s11042_023_16560_x
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-03-01
PublicationDateYYYYMMDD 2024-03-01
PublicationDate_xml – month: 03
  year: 2024
  text: 2024-03-01
  day: 01
PublicationDecade 2020
PublicationPlace New York
PublicationPlace_xml – name: New York
– name: Dordrecht
PublicationSubtitle An International Journal
PublicationTitle Multimedia tools and applications
PublicationTitleAbbrev Multimed Tools Appl
PublicationYear 2024
Publisher Springer US
Springer Nature B.V
Publisher_xml – name: Springer US
– name: Springer Nature B.V
References Mathews A, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30
Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971-10980
Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up visionlanguage pre-training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17980-17989
Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318
LiJWangJZAutomatic linguistic indexing of pictures by a statistical modeling approachIEEE Transactions on pattern analysis and machine intelligence20032591075108810.1109/TPAMI.2003.1227984
WangJXuWWangQChanABOn distinctive image captioning via comparing and reweightingIEEE Transactions on Pattern Analysis and Machine Intelligence20224522088210310.1109/TPAMI.2022.3159811
Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for imagetext matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201-216
Pan J-Y, Yang H-J, Duygulu P, Faloutsos C (2004) Automatic image captioning. In: 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 3, pp. 1987-1990. IEEE
Van Miltenburg E (2016) Stereotyping and bias in the flickr30k dataset. arXiv preprint arXiv:1605.06083
Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 359-368
Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873-881
KimD-JOhT-HChoiJKweonISDense relational image captioning via multi-task triple-stream networksIEEE Transactions on Pattern Analysis and Machine Intelligence202144117348736210.1109/TPAMI.2021.3119754
You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651-4659
Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575
Kumar D, Gehani S, Oza P (2020) A review of deep learning based image captioning models
Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems
Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376-380
ZhouLXuCKochPCorsoJJWatch what you just said: Image captioning with text-conditional attentionProceedings of the on Thematic Workshops of ACM Multimedia2017201730531310.1145/3126686.3126717
Chen J, Guo H, Yi K, Li B, Elhoseiny M (2022) Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030-18040
Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2407-2415
Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894-4902
Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579-5588
Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for visionlanguage tasks. In: European Conference on Computer Vision, pp. 121-137. Springer
Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777-4786
Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128-3137
Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684-699
Khan R, Islam, MS, Kanwal K, Iqbal M, Hossain M, Ye Z et al (2022) A deep neural framework for image caption generation using gru-based attention mechanism. arXiv preprint arXiv:2203.01594
Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence
Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: A survey on image captioning. arXiv preprint arXiv:2107.06912 (2021)
Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565-4574
Ilse M, Tomczak J, Welling M (2018) Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127-2136. PMLR
YoungPLaiAHodoshMHockenmaierJFrom image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptionsTransactions of the Association for Computational Linguistics20142677810.1162/tacl_a_00166
Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 765-773
Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928-8937
Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740-755. Springer
ZhouLPalangiHZhangLHuHCorsoJGaoJUnified visionlanguage pre-training for image captioning and vqaProceedings of the AAAI Conference on Artificial Intelligence202034130411304910.1609/aaai.v34i07.7005
Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11-20
Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422-2431
Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia
Sundaramoorthy C, Kelvin LZ, Sarin M, Gupta S (2021) End-to-end attentionbased image captioning. arXiv preprint arXiv:2104.14721
Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467
KinghornPZhangLShaoLA region-based image caption generator with refined descriptionsNeurocomputing201827241642410.1016/j.neucom.2017.07.014
Hendricks LA, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: Describing novel object categories without paired training data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-10
Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578-10587
Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957
Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395-8404
Hu X, Yin X, Lin K, Wang L, Zhang L, Gao J, Liu Z (2020) Vivo: Visual vocabulary pre-training for novel object captioning. arXiv preprint arXiv:2009.13682
LiNChenZLiuSMeta learning for image captioningProceedings of the AAAI Conference on Artificial Intelligence2019338626863310.1609/aaai.v33i01.33018626
ChenCMuSXiaoWYeZWuLJuQImproving image captioning with conditional generative adversarial netsProceedings of the AAAI Conference on Artificial Intelligence2019338142815010.1609/aaai.v33i01.33018142
Li S, Kulkarni G, Berg T, Berg A, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220-228
XiaoXWangLDingKXiangSPanCDense semantic embedding network for image captioningPattern Recognition2019902852962019PatRe..90..285X10.1016/j.patcog.2019.01.028
Chen, F., Li, X., Tang, J., Li, S., Wang, T.: A survey on recent advances in image captioning. In: Journal of Physics: Conference Series, vol. 1914, p. 012053 (2021). IOP Publishing
SocherRKarpathyALeQVManningCDNgAYGrounded compositional semantics for finding and describing images with sentencesTransactions of the Association for Computational Linguistics2014220721810.1162/tacl_a_00177
Yoshikawa Y, Shigeto Y, Takeuchi A
16560_CR100
J Yu (16560_CR69) 2019; 30
16560_CR102
16560_CR49
16560_CR48
16560_CR46
J Li (16560_CR11) 2003; 25
MW Spratling (16560_CR42) 2004; 16
16560_CR45
16560_CR44
16560_CR43
16560_CR41
L Zhou (16560_CR63) 2020; 34
L Zhou (16560_CR34) 2017; 2017
16560_CR103
16560_CR39
16560_CR38
Y Luo (16560_CR89) 2021; 35
16560_CR37
PH Seo (16560_CR22) 2020; 34
16560_CR36
16560_CR35
16560_CR33
16560_CR32
S Bai (16560_CR40) 2018; 311
16560_CR31
16560_CR30
16560_CR29
16560_CR28
16560_CR27
16560_CR26
16560_CR25
16560_CR24
16560_CR23
16560_CR21
16560_CR20
M Yang (16560_CR59) 2020; 29
X Xiao (16560_CR71) 2019; 90
D-J Kim (16560_CR73) 2021; 44
P Young (16560_CR90) 2014; 2
YH Tan (16560_CR53) 2019; 333
16560_CR19
16560_CR18
16560_CR17
16560_CR16
16560_CR15
16560_CR5
16560_CR14
16560_CR4
16560_CR13
16560_CR7
16560_CR12
16560_CR6
16560_CR99
16560_CR9
16560_CR10
16560_CR98
16560_CR8
16560_CR97
16560_CR96
16560_CR95
16560_CR1
16560_CR3
16560_CR2
D Demner-Fushman (16560_CR83) 2016; 23
J Wang (16560_CR101) 2022; 45
C Chen (16560_CR87) 2019; 33
16560_CR86
16560_CR85
16560_CR84
16560_CR94
16560_CR93
16560_CR92
16560_CR91
N Li (16560_CR88) 2019; 33
16560_CR79
16560_CR78
16560_CR77
16560_CR76
16560_CR75
16560_CR74
16560_CR82
R Socher (16560_CR58) 2014; 2
16560_CR81
16560_CR80
16560_CR68
16560_CR67
16560_CR66
16560_CR65
16560_CR64
16560_CR62
16560_CR72
16560_CR70
HJ Escalante (16560_CR51) 2010; 114
16560_CR57
Z Song (16560_CR47) 2021; 35
16560_CR56
16560_CR55
16560_CR54
16560_CR52
16560_CR61
16560_CR60
P Kinghorn (16560_CR50) 2018; 272
References_xml – reference: Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376-380
– reference: YangMLiuJShenYZhaoZChenXWuQLiCAn ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial networkIEEE Transactions on Image Processing202029962796402020ITIP...29.9627Y416865210.1109/TIP.2020.3028651
– reference: Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11-20
– reference: He S, Liao W, Tavakoli HR, Yang M, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. In: Proceedings of the Asian Conference on Computer Vision
– reference: Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 765-773
– reference: Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971-10980
– reference: Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: A survey on image captioning. arXiv preprint arXiv:2107.06912 (2021)
– reference: Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for imagetext matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201-216
– reference: Hendricks LA, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: Describing novel object categories without paired training data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-10
– reference: LiNChenZLiuSMeta learning for image captioningProceedings of the AAAI Conference on Artificial Intelligence2019338626863310.1609/aaai.v33i01.33018626
– reference: Demner-FushmanDKohliMDRosenmanMBShooshanSERodriguezLAntaniSThomaGRMcDonaldCJPreparing a collection of radiology examinations for distribution and retrievalJ Am Med Inf Assoc201623230431010.1093/jamia/ocv080
– reference: Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318
– reference: SocherRKarpathyALeQVManningCDNgAYGrounded compositional semantics for finding and describing images with sentencesTransactions of the Association for Computational Linguistics2014220721810.1162/tacl_a_00177
– reference: Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894-4902
– reference: SongZZhouXMaoZTanJImage captioning with context-aware auxiliary guidanceProceedings of the AAAI Conference on Artificial Intelligence2021352584259210.1609/aaai.v35i3.16361
– reference: Pan J-Y, Yang H-J, Duygulu P, Faloutsos C (2004) Automatic image captioning. In: 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 3, pp. 1987-1990. IEEE
– reference: YoungPLaiAHodoshMHockenmaierJFrom image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptionsTransactions of the Association for Computational Linguistics20142677810.1162/tacl_a_00166
– reference: Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2407-2415
– reference: YuJLiJYuZHuangQMultimodal transformer with multi-view visual representation for image captioningIEEE Trans Circ Syst Video Technol201930124467448010.1109/TCSVT.2019.2947482
– reference: LiJWangJZAutomatic linguistic indexing of pictures by a statistical modeling approachIEEE Transactions on pattern analysis and machine intelligence20032591075108810.1109/TPAMI.2003.1227984
– reference: Mao J, Xu W, Yang Y,Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632
– reference: Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654 (2014)
– reference: Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777-4786
– reference: KinghornPZhangLShaoLA region-based image caption generator with refined descriptionsNeurocomputing201827241642410.1016/j.neucom.2017.07.014
– reference: Li S, Kulkarni G, Berg T, Berg A, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220-228
– reference: Kim D-J, Choi J, Oh T-H, Kweon IS (2019) Dense relational captioning: Triple-stream networks for relationship-based captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6271-6280
– reference: Kumar D, Gehani S, Oza P (2020) A review of deep learning based image captioning models
– reference: Hessel J, Savva N, Wilber MJ (2015) Image representations and new domains in neural image captioning. arXiv preprint arXiv:1508.02091
– reference: Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395-8404
– reference: Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008-7024
– reference: Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313-10322
– reference: Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence
– reference: Van Miltenburg E (2016) Stereotyping and bias in the flickr30k dataset. arXiv preprint arXiv:1605.06083
– reference: BaiSAnSA survey on automatic image caption generationNeurocomputing201831129130410.1016/j.neucom.2018.05.080
– reference: Héde, P., Moëllic, P.-A., Bourgeoys, J., Joint, M., Thomas, C.: Automatic generation of natural language description for images. In: RIAO, pp. 306-313 (2004). Citeseer
– reference: Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077-6086
– reference: Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048-2057. PMLR
– reference: Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 119-126 (2003)
– reference: Gonog L, Zhou Y (2019) A review: generative adversarial networks. In: 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA), pp. 505- 510. IEEE
– reference: Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2533-2541 (2015)
– reference: XiaoXWangLDingKXiangSPanCDense semantic embedding network for image captioningPattern Recognition2019902852962019PatRe..90..285X10.1016/j.patcog.2019.01.028
– reference: Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578-10587
– reference: Wang, H., Zhang, Y., Yu, X.: An overview of image caption generation methods. Computational intelligence and neuroscience 2020 (2020)
– reference: Chen, F., Li, X., Tang, J., Li, S., Wang, T.: A survey on recent advances in image captioning. In: Journal of Physics: Conference Series, vol. 1914, p. 012053 (2021). IOP Publishing
– reference: KimD-JOhT-HChoiJKweonISDense relational image captioning via multi-task triple-stream networksIEEE Transactions on Pattern Analysis and Machine Intelligence202144117348736210.1109/TPAMI.2021.3119754
– reference: Wikipedia contributors (2022) Photo caption - Wikipedia, The Free Encyclopedia. [Online; accessed 28-February-2022]
– reference: Elhagry, A., Kadaoui, K.: A thorough review on recent deep learning methodologies for image captioning. arXiv preprint arXiv:2107.13114 (2021)
– reference: Mun J, Cho M, Han B (2017) Text-guided attention model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31
– reference: Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740-755. Springer
– reference: by Saheel, S.: Baby talk: Understanding and generating image descriptions
– reference: Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III, H (2012) Midge: Generating image descriptions from computer vision detections. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747-756
– reference: Mathews A, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30
– reference: Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467
– reference: Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579-5588
– reference: Sundaramoorthy C, Kelvin LZ, Sarin M, Gupta S (2021) End-to-end attentionbased image captioning. arXiv preprint arXiv:2104.14721
– reference: Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in NaturalLanguage Processing, pp. 1292-1302
– reference: Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422-2431
– reference: ZhouLPalangiHZhangLHuHCorsoJGaoJUnified visionlanguage pre-training for image captioning and vqaProceedings of the AAAI Conference on Artificial Intelligence202034130411304910.1609/aaai.v34i07.7005
– reference: EscalanteHJHernándezCAGonzalezJALópez-LópezAMontesMMoralesEFSucarLEVillasenorLGrubingerMThe segmented and annotated iapr tc-12 benchmarkComputer vision and image understanding2010114441942810.1016/j.cviu.2009.03.008
– reference: ChenCMuSXiaoWYeZWuLJuQImproving image captioning with conditional generative adversarial netsProceedings of the AAAI Conference on Artificial Intelligence2019338142815010.1609/aaai.v33i01.33018142
– reference: Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957
– reference: Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565-4574
– reference: Feng Y, Ma L, Liu W, Luo J (2019) Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4125-4134
– reference: Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561-5570
– reference: Changpinyo S, Sharma P, Ding N, Soricut R (2021) Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558-3568
– reference: Khan R, Islam, MS, Kanwal K, Iqbal M, Hossain M, Ye Z et al (2022) A deep neural framework for image caption generation using gru-based attention mechanism. arXiv preprint arXiv:2203.01594
– reference: Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia
– reference: Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74-81
– reference: Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, pp. 382-398. Springer
– reference: Sidorov O, Hu R, Rohrbach M, Singh A (2020) Textcaps: a dataset for image captioning with reading comprehension. In: European Conference on Computer Vision, pp. 742-758. Springer
– reference: Mason R, Charniak E (2014) Domain-specific image captioning. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 11-20
– reference: Xiong Y, Du B, Yan P (2019) Reinforced transformer for medical image captioning. In: Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10, pp. 673-680. Springer
– reference: Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684-699
– reference: Tanti M, Gatt A, Camilleri KP (2017) What is the role of recurrent neural networks (rnns) in an image caption generator? arXiv preprint arXiv:1708.02043
– reference: Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873-881
– reference: WangJXuWWangQChanABOn distinctive image captioning via comparing and reweightingIEEE Transactions on Pattern Analysis and Machine Intelligence20224522088210310.1109/TPAMI.2022.3159811
– reference: Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 359-368
– reference: SpratlingMWJohnsonMHA feedback model of visual attentionJournal of cognitive neuroscience20041622192371:STN:280:DC%2BD2c7nsVyksA%3D%3D10.1162/08989290432298452615068593
– reference: Janakiraman J, Unnikrishnan K (1992) A feedback model of visual attention. In: [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, vol. 3, pp. 541-546. IEEE
– reference: Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up visionlanguage pre-training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17980-17989
– reference: Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems
– reference: Venugopalan S, Anne Hendricks L, Rohrbach M, Mooney R, Darrell T, Saenko K (2017) Captioning images with diverse objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5753-5761
– reference: Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556-2565
– reference: Hsu T-Y, Giles CL, Huang T-H (2021) Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624
– reference: ZhouLXuCKochPCorsoJJWatch what you just said: Image captioning with text-conditional attentionProceedings of the on Thematic Workshops of ACM Multimedia2017201730531310.1145/3126686.3126717
– reference: Yoshikawa Y, Shigeto Y, Takeuchi A (2017) Stair captions: Constructing a largescale japanese image caption dataset. arXiv preprint arXiv:1705.00823
– reference: Hu X, Yin X, Lin K, Wang L, Zhang L, Gao J, Liu Z (2020) Vivo: Visual vocabulary pre-training for novel object captioning. arXiv preprint arXiv:2009.13682
– reference: Chen J, Guo H, Yi K, Li B, Elhoseiny M (2022) Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030-18040
– reference: Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575
– reference: Han S-H, Choi H-J (2020) Domain-specific image caption generator with semantic ontology. In: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 526-530. IEEE
– reference: Ilse M, Tomczak J, Welling M (2018) Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127-2136. PMLR
– reference: Tan YH, Chan CS (2016) Phi-lstm: a phrase-based hierarchical lstm model for image captioning. In: Asian Conference on Computer Vision, pp. 101-117 Springer
– reference: Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128-3137
– reference: LuoYJiJSunXCaoLWuYHuangFLinC-WJiRDual-level collaborative transformer for image captioningProceedings of the AAAI Conference on Artificial Intelligence2021352286229310.1609/aaai.v35i3.16328
– reference: Lebret R, Pinheiro PO, Collobert R (2014) Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419
– reference: Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for visionlanguage tasks. In: European Conference on Computer Vision, pp. 121-137. Springer
– reference: Anitha Kumari K, Mouneeshwari C, Udhaya R, Jasmitha R (2019) Automated image captioning for flickr8k dataset. In: International Conference on Artificial Intelligence, Smart Grid and Smart City Applications, pp. 679-687. Springer
– reference: Ordonez, V., Kulkarni, G., Berg, T.: Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24 (2011)
– reference: You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651-4659
– reference: SeoPHSharmaPLevinboimTHanBSoricutRReinforcing an image caption generator using off-line human feedbackProceedings of the AAAI Conference on Artificial Intelligence2020342693270010.1609/aaai.v34i03.5655
– reference: TanYHChanCSPhrase-based image caption generator with hierarchical lstm networkNeurocomputing20193338610010.1016/j.neucom.2018.12.026
– reference: Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928-8937
– reference: Sharma G, Kalena P, Malde N, Nair A, Parkar S (2019) Visual image caption generator using deep learning. In: 2nd International Conference on Advances in Science & Technology (ICAST)
– volume: 34
  start-page: 2693
  year: 2020
  ident: 16560_CR22
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
  doi: 10.1609/aaai.v34i03.5655
– ident: 16560_CR25
  doi: 10.1109/CVPR.2015.7298856
– volume: 90
  start-page: 285
  year: 2019
  ident: 16560_CR71
  publication-title: Pattern Recognition
  doi: 10.1016/j.patcog.2019.01.028
– ident: 16560_CR91
  doi: 10.1109/ICCV.2019.00904
– ident: 16560_CR57
  doi: 10.1007/978-3-030-24051-6_62
– ident: 16560_CR75
  doi: 10.1109/ICCV.2019.01041
– ident: 16560_CR54
  doi: 10.1007/978-3-319-54193-8_7
– ident: 16560_CR30
  doi: 10.18653/v1/W17-3506
– ident: 16560_CR62
  doi: 10.1109/TPAMI.2022.3148210
– ident: 16560_CR6
  doi: 10.1109/ICCV.2015.291
– ident: 16560_CR45
  doi: 10.1007/978-3-030-01225-0_13
– ident: 16560_CR4
– ident: 16560_CR48
  doi: 10.1109/ICCV.2017.100
– volume: 333
  start-page: 86
  year: 2019
  ident: 16560_CR53
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2018.12.026
– ident: 16560_CR24
– ident: 16560_CR27
  doi: 10.1609/aaai.v30i1.10475
– ident: 16560_CR43
  doi: 10.1109/CVPR42600.2020.01098
– ident: 16560_CR13
– ident: 16560_CR3
– ident: 16560_CR93
  doi: 10.18653/v1/2021.findings-emnlp.277
– ident: 16560_CR39
  doi: 10.1007/978-3-030-01264-9_42
– volume: 2
  start-page: 67
  year: 2014
  ident: 16560_CR90
  publication-title: Transactions of the Association for Computational Linguistics
  doi: 10.1162/tacl_a_00166
– ident: 16560_CR7
– ident: 16560_CR18
  doi: 10.18653/v1/W15-2807
– ident: 16560_CR21
– ident: 16560_CR95
  doi: 10.1109/CVPR.2016.9
– ident: 16560_CR70
  doi: 10.1007/978-3-030-32692-0_77
– volume: 16
  start-page: 219
  issue: 2
  year: 2004
  ident: 16560_CR42
  publication-title: Journal of cognitive neuroscience
  doi: 10.1162/089892904322984526
– ident: 16560_CR31
  doi: 10.1109/CVPR.2018.00583
– ident: 16560_CR29
– volume: 34
  start-page: 13041
  year: 2020
  ident: 16560_CR63
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
  doi: 10.1609/aaai.v34i07.7005
– ident: 16560_CR12
– ident: 16560_CR2
  doi: 10.1088/1742-6596/1914/1/012053
– ident: 16560_CR67
  doi: 10.1109/CVPR42600.2020.01059
– ident: 16560_CR15
  doi: 10.3115/v1/W14-1602
– ident: 16560_CR77
– volume: 311
  start-page: 291
  year: 2018
  ident: 16560_CR40
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2018.05.080
– ident: 16560_CR81
  doi: 10.1109/CVPR52688.2022.01750
– volume: 29
  start-page: 9627
  year: 2020
  ident: 16560_CR59
  publication-title: IEEE Transactions on Image Processing
  doi: 10.1109/TIP.2020.3028651
– ident: 16560_CR74
  doi: 10.1109/CVPR.2016.494
– ident: 16560_CR9
  doi: 10.1109/CVPR.2015.7298856
– ident: 16560_CR92
  doi: 10.18653/v1/P17-2066
– ident: 16560_CR26
  doi: 10.1109/CVPR.2015.7298932
– ident: 16560_CR97
  doi: 10.3115/1073083.1073135
– ident: 16560_CR5
  doi: 10.1155/2020/3062706
– ident: 16560_CR72
  doi: 10.1109/CVPR.2019.00643
– ident: 16560_CR99
– ident: 16560_CR46
  doi: 10.1109/CVPR.2017.131
– ident: 16560_CR85
  doi: 10.1609/aaai.v35i2.16249
– ident: 16560_CR38
  doi: 10.1109/CVPR.2018.00636
– ident: 16560_CR17
– ident: 16560_CR82
  doi: 10.18653/v1/P18-1238
– ident: 16560_CR84
  doi: 10.1007/978-3-030-58577-8_8
– ident: 16560_CR76
  doi: 10.1109/TMM.2023.3241517
– volume: 33
  start-page: 8626
  year: 2019
  ident: 16560_CR88
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
  doi: 10.1609/aaai.v33i01.33018626
– ident: 16560_CR61
– ident: 16560_CR103
– ident: 16560_CR35
  doi: 10.1109/ICCV.2015.277
– ident: 16560_CR98
  doi: 10.3115/v1/W14-3348
– ident: 16560_CR60
  doi: 10.1109/CVPR.2019.00425
– ident: 16560_CR37
– volume: 272
  start-page: 416
  year: 2018
  ident: 16560_CR50
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2017.07.014
– volume: 114
  start-page: 419
  issue: 4
  year: 2010
  ident: 16560_CR51
  publication-title: Computer vision and image understanding
  doi: 10.1016/j.cviu.2009.03.008
– ident: 16560_CR14
– ident: 16560_CR52
– ident: 16560_CR68
  doi: 10.1109/ICCV.2019.00902
– ident: 16560_CR100
  doi: 10.1109/CVPR.2015.7299087
– ident: 16560_CR8
– ident: 16560_CR78
  doi: 10.2139/ssrn.3368837
– ident: 16560_CR20
– ident: 16560_CR96
  doi: 10.1109/CVPR46437.2021.00356
– ident: 16560_CR49
– ident: 16560_CR55
– ident: 16560_CR86
  doi: 10.1109/ICIEA.2019.8833686
– ident: 16560_CR44
  doi: 10.1109/CVPR42600.2020.00483
– ident: 16560_CR41
  doi: 10.1109/IJCNN.1992.227117
– ident: 16560_CR64
  doi: 10.1109/CVPR46437.2021.00553
– volume: 2017
  start-page: 305
  year: 2017
  ident: 16560_CR34
  publication-title: Proceedings of the on Thematic Workshops of ACM Multimedia
  doi: 10.1145/3126686.3126717
– ident: 16560_CR36
  doi: 10.1609/aaai.v31i1.11237
– ident: 16560_CR65
  doi: 10.1109/CVPR52688.2022.01745
– ident: 16560_CR16
  doi: 10.1109/BigComp48618.2020.00-12
– ident: 16560_CR19
– volume: 35
  start-page: 2286
  year: 2021
  ident: 16560_CR89
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
  doi: 10.1609/aaai.v35i3.16328
– ident: 16560_CR80
  doi: 10.1109/CVPR.2017.130
– ident: 16560_CR102
  doi: 10.1007/978-3-319-46454-1_24
– ident: 16560_CR66
  doi: 10.1007/978-3-030-69538-5_10
– volume: 25
  start-page: 1075
  issue: 9
  year: 2003
  ident: 16560_CR11
  publication-title: IEEE Transactions on pattern analysis and machine intelligence
  doi: 10.1109/TPAMI.2003.1227984
– volume: 30
  start-page: 4467
  issue: 12
  year: 2019
  ident: 16560_CR69
  publication-title: IEEE Trans Circ Syst Video Technol
  doi: 10.1109/TCSVT.2019.2947482
– ident: 16560_CR94
  doi: 10.1007/978-3-030-58536-5_44
– volume: 44
  start-page: 7348
  issue: 11
  year: 2021
  ident: 16560_CR73
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
  doi: 10.1109/TPAMI.2021.3119754
– ident: 16560_CR23
  doi: 10.1109/CVPR.2019.00859
– volume: 33
  start-page: 8142
  year: 2019
  ident: 16560_CR87
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
  doi: 10.1609/aaai.v33i01.33018142
– ident: 16560_CR56
  doi: 10.1007/978-3-319-10602-1_48
– ident: 16560_CR32
  doi: 10.1145/3343031.3350943
– ident: 16560_CR33
  doi: 10.1109/CVPR.2016.503
– volume: 45
  start-page: 2088
  issue: 2
  year: 2022
  ident: 16560_CR101
  publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence
  doi: 10.1109/TPAMI.2022.3159811
– volume: 23
  start-page: 304
  issue: 2
  year: 2016
  ident: 16560_CR83
  publication-title: J Am Med Inf Assoc
  doi: 10.1093/jamia/ocv080
– ident: 16560_CR28
  doi: 10.1109/ICCV.2017.524
– ident: 16560_CR10
  doi: 10.1145/860435.860459
– ident: 16560_CR1
– ident: 16560_CR79
  doi: 10.1109/CVPR.2016.8
– volume: 35
  start-page: 2584
  year: 2021
  ident: 16560_CR47
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
  doi: 10.1609/aaai.v35i3.16361
– volume: 2
  start-page: 207
  year: 2014
  ident: 16560_CR58
  publication-title: Transactions of the Association for Computational Linguistics
  doi: 10.1162/tacl_a_00177
SSID ssj0016524
Score 2.355032
Snippet Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating...
SourceID proquest
crossref
springer
SourceType Aggregation Database
Enrichment Source
Index Database
Publisher
StartPage 28077
SubjectTerms Artificial intelligence
Computer Communication Networks
Computer Science
Computer vision
Data Structures and Information Theory
Machine learning
Multimedia Information Systems
Natural language
Natural language processing
Recurrent neural networks
Special Purpose and Application-Based Systems
Track 6: Computer Vision for Multimedia Applications
SummonAdditionalLinks – databaseName: SpringerLink Journals (ICM)
  dbid: U2A
  link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8NAFB6kXvTgUhWrVebgTQeSyWQZD0IpliroyUJvYdaTTaRJRf-9b7I0Kip4zkwOb97yPb63IHRheBJYZhmJrHQ0Y8QJ17EivuTSF7EnPeUY3YfHaDpj9_Nw3jSFFW21e0tJVp66a3bzXSsJxBhSTYwhgBw3Q5e7gxbP6GjNHUQhZU17zM_3voagDld-o0KrCDPZQzsNNMSj-i330YbJ-mi3XbuAGyvso-1PMwQP0M1kmS9wvQi6wGWOXc1nYcriGo9wsVq-mnecZ_huAY6DjEXlIXA9bNot2jlEs8nt03hKmqUIRIG1lCTUwnhMRFRxrWmiqYoFTyAlBvASG0qtJ0PP14ILS4VlkkK-AzmD1Y5hiZkJjlAvyzNzjHCgEgsIQ_kaArVmlkNuE8kokUJpG9lggPxWTqlqJoa7xRXPaTfr2Mk2BdmmlWzTtwG6XN95qedl_Hl62Io_bWynSAGABK5LjnsDdNU-Sff597-d_O_4KdoC7WF1QdkQ9crlypwBwijleaVQHzDiyAw
  priority: 102
  providerName: Springer Nature
Title From methods to datasets: A survey on Image-Caption Generators
URI https://link.springer.com/article/10.1007/s11042-023-16560-x
https://www.proquest.com/docview/2933270290
Volume 83
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3NT9swFH8a9LIdGGNMlEHlw25gLXHSJOYw1KIGNkSF0CrBKfLnCRpoAoL_fs-JQwFpnHJwbMnPfl9-Hz-AH4ZnkY1tTBMrXZgx4ZTrVNFQchmKNJCBchHds2lyMov_XA4v_YNb5dMqO5nYCGpdKvdG_hPVUuRqp3hweHtHHWqUi656CI0V6KEIztD56o0n0_OL5zhCMvSwtllAUTeGvmymLZ4LXWkK6izadKChj69V09LefBMibTRPvg5r3mQko_aMv8AHM9-Azx0cA_HcuQGfXvQW_Aq_8kV5Q1qA6IrUJXG5oJWpqwMyItX94sE8kXJOft-gQKFHopEcpG1C7QB4NmGWT_4enVAPlkAVclFNh1qYIBYJU1xrlmmmUsEzdJXRqEkNYzaQuH8tuLBM2Fgy9IPQl7DaRV7S2ETfYHVezs0WkEhlFi0PFWpU4Dq2HH2eRCaZFErbxEZ9CDs6Fcp3EneAFtfFsgeyo22BtC0a2haPfdh7nnPb9tF49--djvyF56mqWN6APux3R7Ic_v9q2--v9h0-Mtxom1i2A6v14t7soqVRywGsZPnxAHqjfDyeuu_x1elk4C8Zjs7Y6B-QGtK9
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB4BPUAPtLzUUCh7KCdYYa8d24tEUUQJSXmcQOJm9nkicYgdCn-qv7GzfiQtEtw4215pZ2fn4W9mPoDvhieBDW1IIysdzBhxynWsqC-59EXsSU85RPfyKurdhL9u27dz8KfphXFllY1NLA21zpT7R36AbilwvVPcOx49UMca5dDVhkKjUotz8_wbU7b8qP8Tz3eXse7p9UmP1qwCVKG6FbSthfFCETHFtWaJZioWPMGcEr1_bBiznmx7vhZcWCZsKBkmDBh0W-0gijg0Aa47Dx_CIODuRiXdsylqEbVrEt3Eo-iJ_bpJp2rV810jDHpIWs67oU__O8JZdPsCkC39XPczLNcBKulUGrUCc2a4Cp8a8gdS24JV-PjPJMM1-NEdZwNS0VHnpMiIqzzNTZEfkg7JJ-NH80yyIekP0HzRE1HaKVKNvHZ0P-tw8y5C3ICFYTY0X4AEKrEY5yhfY7igQ8sxw4pklEihtI1s0AK_kVOq6rnljj7jPp1NXHayTVG2aSnb9KkFe9NvRtXUjjff3mrEn9Y3OE9n-taC_eZIZo9fX23z7dV2YLF3fXmRXvSvzr_CEsNNVyVtW7BQjCdmG2OcQn4rFYvA3Xtr8l_irQmR
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB6FRKrg0AcFkZa2e2hPZYW9dmxvpYJSICKljVAFEjezzxPEEJsW_lp_XWftddIilRtn2yN59vM8PI8P4L3hWWRjG9PESldmTDjlOlU0lFyGIg1koFxF9_skOTyNv54Nzjrwu52FcW2VrU2sDbUulPtHvo1uKXKzUzzYtr4t4nh_tHt1TR2DlKu0tnQaDUSOzN0vTN_Kz-N9POsPjI0OTvYOqWcYoAqhV9GBFiaIRcIU15plmqlU8AzzS4wEUsOYDeQgCLXgwjJhY8kwecAA3GpXrkhjE6HcJeilmBUFXeh9OZgc_5jXMJKBp9TNAop-OfQjO83gXujGYtBf0nr7Db391y0uYt175dna642ew1MfrpJhg68X0DHTVXjWUkEQbxlWYeWvvYYvYWc0Ky5JQ05dkqogrg-1NFX5iQxJeTP7ae5IMSXjSzRmdE_UVos0C7Ad-c8anD6KGtehOy2mZgNIpDKLUY8KNQYPOrYc861EJpkUStvERn0IWz3lym8xd2QaF_li_7LTbY66zWvd5rd9-Dh_5qrZ4fHg3Zut-nP_PZf5An192GqPZHH5_9JePSztHTxBFOffxpOj17DM8J2b_rZN6FazG_MGA55KvvXIInD-2GD-A9awDyM
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=From+methods+to+datasets%3A+A+survey+on+Image-Caption+Generators&rft.jtitle=Multimedia+tools+and+applications&rft.au=Agarwal%2C+Lakshita&rft.au=Verma%2C+Bindu&rft.date=2024-03-01&rft.pub=Springer+Nature+B.V&rft.issn=1380-7501&rft.eissn=1573-7721&rft.volume=83&rft.issue=9&rft.spage=28077&rft.epage=28123&rft_id=info:doi/10.1007%2Fs11042-023-16560-x&rft.externalDBID=HAS_PDF_LINK
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1573-7721&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1573-7721&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1573-7721&client=summon