From methods to datasets: A survey on Image-Caption Generators

Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating well-structured sentences requires a thorough understanding of language in a systematic and semantic way. Being able to describe the substance of an imag...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 83; no. 9; pp. 28077 - 28123
Main Authors	Agarwal, Lakshita, Verma, Bindu
Format	Journal Article
Language	English
Published	New York Springer US 01.03.2024 Springer Nature B.V
Subjects	Artificial intelligence Computer Communication Networks Computer Science Computer vision Data Structures and Information Theory Machine learning Multimedia Information Systems Natural language Natural language processing Recurrent neural networks Special Purpose and Application-Based Systems Track 6: Computer Vision for Multimedia Applications Image- Caption Generator Deep learning Computer vision Natural language processing Intelligent exploration
Online Access	Get full text

Cover

Loading…

Abstract	Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating well-structured sentences requires a thorough understanding of language in a systematic and semantic way. Being able to describe the substance of an image using well-structured phrases is a difficult undertaking, but it can have a significant impact in terms of assisting visually impaired people in better understanding the images’ content. Image captions has gained a lot of attention as a study subject for various computer vision and natural language processing (NLP) applications. The goal of image captions is to create logical and accurate natural language phrases that describes an image. It relies on the caption model to see items and appropriately characterise their relationships. Intuitively, it is also difficult for a machine to see a typical image in the same way that humans do. It does, however, provide the foundation for intelligent exploration in deep learning. In this review paper, we will focus on the latest in-depth advanced captions techniques for image captioning. This paper highlights related methodologies and focuses on aspects that are crucial in computer recognition, as well as on the numerous strategies and procedures being developed for the development of image captions. It was also observed that Recurrent neural networks (RNNs) are used in the bulk of research works (45%), followed by attention-based models (30%), transformer-based models (15%) and other methods (10%). An overview of the approaches utilised in image captioning research is discussed in this paper. Furthermore, the benefits and drawbacks of these methodologies are explored, as well as the most regularly used data sets and evaluation processes in this sector are being studied.
AbstractList	Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating well-structured sentences requires a thorough understanding of language in a systematic and semantic way. Being able to describe the substance of an image using well-structured phrases is a difficult undertaking, but it can have a significant impact in terms of assisting visually impaired people in better understanding the images’ content. Image captions has gained a lot of attention as a study subject for various computer vision and natural language processing (NLP) applications. The goal of image captions is to create logical and accurate natural language phrases that describes an image. It relies on the caption model to see items and appropriately characterise their relationships. Intuitively, it is also difficult for a machine to see a typical image in the same way that humans do. It does, however, provide the foundation for intelligent exploration in deep learning. In this review paper, we will focus on the latest in-depth advanced captions techniques for image captioning. This paper highlights related methodologies and focuses on aspects that are crucial in computer recognition, as well as on the numerous strategies and procedures being developed for the development of image captions. It was also observed that Recurrent neural networks (RNNs) are used in the bulk of research works (45%), followed by attention-based models (30%), transformer-based models (15%) and other methods (10%). An overview of the approaches utilised in image captioning research is discussed in this paper. Furthermore, the benefits and drawbacks of these methodologies are explored, as well as the most regularly used data sets and evaluation processes in this sector are being studied.
Author	Agarwal, Lakshita Verma, Bindu
Author_xml	– sequence: 1 givenname: Lakshita surname: Agarwal fullname: Agarwal, Lakshita organization: Department of Information Technology, Delhi Technological University – sequence: 2 givenname: Bindu orcidid: 0000-0003-3534-3364 surname: Verma fullname: Verma, Bindu email: bindu.cvision@gmail.com organization: Department of Information Technology, Delhi Technological University
BookMark	eNp9kEFLAzEQhYNUsK3-AU8Bz9HJZHez60EoxdaC4EXPId1k65bupiaptP_erSsoHnqaGXjfvJk3IoPWtZaQaw63HEDeBc4hQQYoGM_SDNj-jAx5KgWTEvngT39BRiGsAToZJkPyMPOuoY2N784EGh01OupgY7inExp2_tMeqGvpotEry6Z6G-tumtvWeh2dD5fkvNKbYK9-6pi8zR5fp0_s-WW-mE6eWSl4EVlqtIVEZ1gWxmBusJS6yO1SAKbSIlawTIEbXegKdZUskSd5IZPKZKlMZWLFmNz0e7fefexsiGrtdr7tLBUWQqAELKBTYa8qvQvB20ptfd1of1Ac1DEn1eekupzUd05q30H5P6isoz7-Gb2uN6dR0aOh82lX1v9edYL6AojEfl0
CitedBy_id	crossref_primary_10_1007_s00371_024_03790_9
Cites_doi	10.1609/aaai.v34i03.5655 10.1109/CVPR.2015.7298856 10.1016/j.patcog.2019.01.028 10.1109/ICCV.2019.00904 10.1007/978-3-030-24051-6_62 10.1109/ICCV.2019.01041 10.1007/978-3-319-54193-8_7 10.18653/v1/W17-3506 10.1109/TPAMI.2022.3148210 10.1109/ICCV.2015.291 10.1007/978-3-030-01225-0_13 10.1109/ICCV.2017.100 10.1016/j.neucom.2018.12.026 10.1609/aaai.v30i1.10475 10.1109/CVPR42600.2020.01098 10.18653/v1/2021.findings-emnlp.277 10.1007/978-3-030-01264-9_42 10.1162/tacl_a_00166 10.18653/v1/W15-2807 10.1109/CVPR.2016.9 10.1007/978-3-030-32692-0_77 10.1162/089892904322984526 10.1109/CVPR.2018.00583 10.1609/aaai.v34i07.7005 10.1088/1742-6596/1914/1/012053 10.1109/CVPR42600.2020.01059 10.3115/v1/W14-1602 10.1016/j.neucom.2018.05.080 10.1109/CVPR52688.2022.01750 10.1109/TIP.2020.3028651 10.1109/CVPR.2016.494 10.18653/v1/P17-2066 10.1109/CVPR.2015.7298932 10.3115/1073083.1073135 10.1155/2020/3062706 10.1109/CVPR.2019.00643 10.1109/CVPR.2017.131 10.1609/aaai.v35i2.16249 10.1109/CVPR.2018.00636 10.18653/v1/P18-1238 10.1007/978-3-030-58577-8_8 10.1109/TMM.2023.3241517 10.1609/aaai.v33i01.33018626 10.1109/ICCV.2015.277 10.3115/v1/W14-3348 10.1109/CVPR.2019.00425 10.1016/j.neucom.2017.07.014 10.1016/j.cviu.2009.03.008 10.1109/ICCV.2019.00902 10.1109/CVPR.2015.7299087 10.2139/ssrn.3368837 10.1109/CVPR46437.2021.00356 10.1109/ICIEA.2019.8833686 10.1109/CVPR42600.2020.00483 10.1109/IJCNN.1992.227117 10.1109/CVPR46437.2021.00553 10.1145/3126686.3126717 10.1609/aaai.v31i1.11237 10.1109/CVPR52688.2022.01745 10.1109/BigComp48618.2020.00-12 10.1609/aaai.v35i3.16328 10.1109/CVPR.2017.130 10.1007/978-3-319-46454-1_24 10.1007/978-3-030-69538-5_10 10.1109/TPAMI.2003.1227984 10.1109/TCSVT.2019.2947482 10.1007/978-3-030-58536-5_44 10.1109/TPAMI.2021.3119754 10.1109/CVPR.2019.00859 10.1609/aaai.v33i01.33018142 10.1007/978-3-319-10602-1_48 10.1145/3343031.3350943 10.1109/CVPR.2016.503 10.1109/TPAMI.2022.3159811 10.1093/jamia/ocv080 10.1109/ICCV.2017.524 10.1145/860435.860459 10.1109/CVPR.2016.8 10.1609/aaai.v35i3.16361 10.1162/tacl_a_00177
ContentType	Journal Article
Copyright	The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
Copyright_xml	– notice: The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
DBID	AAYXX CITATION 3V. 7SC 7WY 7WZ 7XB 87Z 8AL 8AO 8FD 8FE 8FG 8FK 8FL 8G5 ABUWG AFKRA ARAPS AZQEC BENPR BEZIV BGLVJ CCPQU DWQXO FRNLG F~G GNUQQ GUQSH HCIFZ JQ2 K60 K6~ K7- L.- L7M L~C L~D M0C M0N M2O MBDVC P5Z P62 PHGZM PHGZT PKEHL PQBIZ PQBZA PQEST PQGLB PQQKQ PQUKI Q9U
DOI	10.1007/s11042-023-16560-x
DatabaseName	CrossRef ProQuest Central (Corporate) Computer and Information Systems Abstracts ABI/INFORM Collection ABI/INFORM Global (PDF only) ProQuest Central (purchase pre-March 2016) ABI/INFORM Collection Computing Database (Alumni Edition) ProQuest Pharma Collection Technology Research Database ProQuest SciTech Collection ProQuest Technology Collection ProQuest Central (Alumni) (purchase pre-March 2016) ABI/INFORM Collection (Alumni) ProQuest Research Library ProQuest Central (Alumni) ProQuest Central UK/Ireland Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Central Business Premium Collection Technology Collection ProQuest One Community College ProQuest Central Business Premium Collection (Alumni) ABI/INFORM Global (Corporate) ProQuest Central Student ProQuest Research Library ProQuest SciTech Premium Collection ProQuest Computer Science Collection ProQuest Business Collection (Alumni Edition) ProQuest Business Collection Computer Science Database ABI/INFORM Professional Advanced Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional ABI/INFORM Global Computing Database Research Library Research Library (Corporate) Advanced Technologies & Aerospace Database ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Premium ProQuest One Academic (New) ProQuest One Academic Middle East (New) ProQuest One Business (OCUL) ProQuest One Business (Alumni) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Applied & Life Sciences ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central Basic
DatabaseTitle	CrossRef ABI/INFORM Global (Corporate) ProQuest Business Collection (Alumni Edition) ProQuest One Business Research Library Prep Computer Science Database ProQuest Central Student Technology Collection Technology Research Database Computer and Information Systems Abstracts – Academic ProQuest One Academic Middle East (New) ProQuest Advanced Technologies & Aerospace Collection ProQuest Central Essentials ProQuest Computer Science Collection Computer and Information Systems Abstracts ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College Research Library (Alumni Edition) ProQuest Pharma Collection ABI/INFORM Complete ProQuest Central ABI/INFORM Professional Advanced ProQuest One Applied & Life Sciences ProQuest Central Korea ProQuest Research Library ProQuest Central (New) Advanced Technologies Database with Aerospace ABI/INFORM Complete (Alumni Edition) Advanced Technologies & Aerospace Collection Business Premium Collection ABI/INFORM Global ProQuest Computing ABI/INFORM Global (Alumni Edition) ProQuest Central Basic ProQuest Computing (Alumni Edition) ProQuest One Academic Eastern Edition ProQuest Technology Collection ProQuest SciTech Collection ProQuest Business Collection Computer and Information Systems Abstracts Professional Advanced Technologies & Aerospace Database ProQuest One Academic UKI Edition ProQuest One Business (Alumni) ProQuest One Academic ProQuest One Academic (New) ProQuest Central (Alumni) Business Premium Collection (Alumni)
DatabaseTitleList	ABI/INFORM Global (Corporate)
Database_xml	– sequence: 1 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science
EISSN	1573-7721
EndPage	28123
ExternalDocumentID	10_1007_s11042_023_16560_x
GroupedDBID	-4Z -59 -5G -BR -EM -Y2 -~C .4S .86 .DC .VR 06D 0R~ 0VY 123 1N0 1SB 2.D 203 28- 29M 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 3EH 3V. 4.4 406 408 409 40D 40E 5QI 5VS 67Z 6NX 7WY 8AO 8FE 8FG 8FL 8G5 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYZH ABAKF ABBBX ABBXA ABDZT ABECU ABFTV ABHLI ABHQN ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFO ACGFS ACHSB ACHXU ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACREN ACSNA ACZOJ ADHHG ADHIR ADIMF ADINQ ADKNI ADKPE ADMLS ADRFC ADTPH ADURQ ADYFF ADYOE ADZKW AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AFBBN AFEXP AFGCZ AFKRA AFLOW AFQWF AFWTZ AFYQB AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHSBF AHYZX AIAKS AIGIU AIIXL AILAN AITGF AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMTXH AMXSW AMYLF AMYQR AOCGG ARAPS ARCSS ARMRJ ASPBG AVWKF AXYYD AYJHY AZFZN AZQEC B-. BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ BSONS CAG CCPQU COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DU5 DWQXO EBLON EBS EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ6 GQ7 GQ8 GROUPED_ABI_INFORM_COMPLETE GUQSH GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I-F I09 IHE IJ- IKXTQ ITG ITH ITM IWAJR IXC IXE IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV KOW LAK LLZTM M0C M0N M2O M4Y MA- N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P2P P62 P9O PF0 PQBIZ PQBZA PQQKQ PROAC PT4 PT5 Q2X QOK QOS R4E R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TEORI TH9 TSG TSK TSV TUC TUS U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 Z7R Z7S Z7W Z7X Z7Y Z7Z Z81 Z83 Z86 Z88 Z8M Z8N Z8Q Z8R Z8S Z8T Z8U Z8W Z92 ZMTXR ~EX AAPKM AAYXX ABBRH ABDBE ABFSG ACMFV ACSTC ADKFA AEZWR AFDZB AFHIU AFOHR AHPBZ AHWEU AIXLP ATHPR AYFIA CITATION PHGZM PHGZT 7SC 7XB 8AL 8FD 8FK ABRTQ JQ2 L.- L7M L~C L~D MBDVC PKEHL PQEST PQGLB PQUKI Q9U
ID	FETCH-LOGICAL-c319t-5dae04a62c9dd28d2c7a98eb30257e22f0b501da9af2af4b2148974fd657574e3
IEDL.DBID	BENPR
ISSN	1573-7721 1380-7501
IngestDate	Sat Aug 23 12:44:03 EDT 2025 Thu Apr 24 23:09:33 EDT 2025 Tue Jul 01 04:13:26 EDT 2025 Fri Feb 21 02:41:55 EST 2025
IsPeerReviewed	true
IsScholarly	true
Issue	9
Keywords	Image- Caption Generator Deep learning Computer vision Natural language processing Intelligent exploration
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c319t-5dae04a62c9dd28d2c7a98eb30257e22f0b501da9af2af4b2148974fd657574e3
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0003-3534-3364
PQID	2933270290
PQPubID	54626
PageCount	47
ParticipantIDs	proquest_journals_2933270290 crossref_primary_10_1007_s11042_023_16560_x crossref_citationtrail_10_1007_s11042_023_16560_x springer_journals_10_1007_s11042_023_16560_x
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-03-01
PublicationDateYYYYMMDD	2024-03-01
PublicationDate_xml	– month: 03 year: 2024 text: 2024-03-01 day: 01
PublicationDecade	2020
PublicationPlace	New York
PublicationPlace_xml	– name: New York – name: Dordrecht
PublicationSubtitle	An International Journal
PublicationTitle	Multimedia tools and applications
PublicationTitleAbbrev	Multimed Tools Appl
PublicationYear	2024
Publisher	Springer US Springer Nature B.V
Publisher_xml	– name: Springer US – name: Springer Nature B.V
References	Mathews A, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971-10980 Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up visionlanguage pre-training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17980-17989 Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318 LiJWangJZAutomatic linguistic indexing of pictures by a statistical modeling approachIEEE Transactions on pattern analysis and machine intelligence20032591075108810.1109/TPAMI.2003.1227984 WangJXuWWangQChanABOn distinctive image captioning via comparing and reweightingIEEE Transactions on Pattern Analysis and Machine Intelligence20224522088210310.1109/TPAMI.2022.3159811 Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for imagetext matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201-216 Pan J-Y, Yang H-J, Duygulu P, Faloutsos C (2004) Automatic image captioning. In: 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 3, pp. 1987-1990. IEEE Van Miltenburg E (2016) Stereotyping and bias in the flickr30k dataset. arXiv preprint arXiv:1605.06083 Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 359-368 Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873-881 KimD-JOhT-HChoiJKweonISDense relational image captioning via multi-task triple-stream networksIEEE Transactions on Pattern Analysis and Machine Intelligence202144117348736210.1109/TPAMI.2021.3119754 You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651-4659 Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575 Kumar D, Gehani S, Oza P (2020) A review of deep learning based image captioning models Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376-380 ZhouLXuCKochPCorsoJJWatch what you just said: Image captioning with text-conditional attentionProceedings of the on Thematic Workshops of ACM Multimedia2017201730531310.1145/3126686.3126717 Chen J, Guo H, Yi K, Li B, Elhoseiny M (2022) Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030-18040 Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2407-2415 Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894-4902 Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579-5588 Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for visionlanguage tasks. In: European Conference on Computer Vision, pp. 121-137. Springer Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777-4786 Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128-3137 Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684-699 Khan R, Islam, MS, Kanwal K, Iqbal M, Hossain M, Ye Z et al (2022) A deep neural framework for image caption generation using gru-based attention mechanism. arXiv preprint arXiv:2203.01594 Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: A survey on image captioning. arXiv preprint arXiv:2107.06912 (2021) Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565-4574 Ilse M, Tomczak J, Welling M (2018) Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127-2136. PMLR YoungPLaiAHodoshMHockenmaierJFrom image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptionsTransactions of the Association for Computational Linguistics20142677810.1162/tacl_a_00166 Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 765-773 Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928-8937 Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740-755. Springer ZhouLPalangiHZhangLHuHCorsoJGaoJUnified visionlanguage pre-training for image captioning and vqaProceedings of the AAAI Conference on Artificial Intelligence202034130411304910.1609/aaai.v34i07.7005 Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11-20 Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422-2431 Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia Sundaramoorthy C, Kelvin LZ, Sarin M, Gupta S (2021) End-to-end attentionbased image captioning. arXiv preprint arXiv:2104.14721 Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 KinghornPZhangLShaoLA region-based image caption generator with refined descriptionsNeurocomputing201827241642410.1016/j.neucom.2017.07.014 Hendricks LA, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: Describing novel object categories without paired training data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-10 Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578-10587 Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957 Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395-8404 Hu X, Yin X, Lin K, Wang L, Zhang L, Gao J, Liu Z (2020) Vivo: Visual vocabulary pre-training for novel object captioning. arXiv preprint arXiv:2009.13682 LiNChenZLiuSMeta learning for image captioningProceedings of the AAAI Conference on Artificial Intelligence2019338626863310.1609/aaai.v33i01.33018626 ChenCMuSXiaoWYeZWuLJuQImproving image captioning with conditional generative adversarial netsProceedings of the AAAI Conference on Artificial Intelligence2019338142815010.1609/aaai.v33i01.33018142 Li S, Kulkarni G, Berg T, Berg A, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220-228 XiaoXWangLDingKXiangSPanCDense semantic embedding network for image captioningPattern Recognition2019902852962019PatRe..90..285X10.1016/j.patcog.2019.01.028 Chen, F., Li, X., Tang, J., Li, S., Wang, T.: A survey on recent advances in image captioning. In: Journal of Physics: Conference Series, vol. 1914, p. 012053 (2021). IOP Publishing SocherRKarpathyALeQVManningCDNgAYGrounded compositional semantics for finding and describing images with sentencesTransactions of the Association for Computational Linguistics2014220721810.1162/tacl_a_00177 Yoshikawa Y, Shigeto Y, Takeuchi A 16560_CR100 J Yu (16560_CR69) 2019; 30 16560_CR102 16560_CR49 16560_CR48 16560_CR46 J Li (16560_CR11) 2003; 25 MW Spratling (16560_CR42) 2004; 16 16560_CR45 16560_CR44 16560_CR43 16560_CR41 L Zhou (16560_CR63) 2020; 34 L Zhou (16560_CR34) 2017; 2017 16560_CR103 16560_CR39 16560_CR38 Y Luo (16560_CR89) 2021; 35 16560_CR37 PH Seo (16560_CR22) 2020; 34 16560_CR36 16560_CR35 16560_CR33 16560_CR32 S Bai (16560_CR40) 2018; 311 16560_CR31 16560_CR30 16560_CR29 16560_CR28 16560_CR27 16560_CR26 16560_CR25 16560_CR24 16560_CR23 16560_CR21 16560_CR20 M Yang (16560_CR59) 2020; 29 X Xiao (16560_CR71) 2019; 90 D-J Kim (16560_CR73) 2021; 44 P Young (16560_CR90) 2014; 2 YH Tan (16560_CR53) 2019; 333 16560_CR19 16560_CR18 16560_CR17 16560_CR16 16560_CR15 16560_CR5 16560_CR14 16560_CR4 16560_CR13 16560_CR7 16560_CR12 16560_CR6 16560_CR99 16560_CR9 16560_CR10 16560_CR98 16560_CR8 16560_CR97 16560_CR96 16560_CR95 16560_CR1 16560_CR3 16560_CR2 D Demner-Fushman (16560_CR83) 2016; 23 J Wang (16560_CR101) 2022; 45 C Chen (16560_CR87) 2019; 33 16560_CR86 16560_CR85 16560_CR84 16560_CR94 16560_CR93 16560_CR92 16560_CR91 N Li (16560_CR88) 2019; 33 16560_CR79 16560_CR78 16560_CR77 16560_CR76 16560_CR75 16560_CR74 16560_CR82 R Socher (16560_CR58) 2014; 2 16560_CR81 16560_CR80 16560_CR68 16560_CR67 16560_CR66 16560_CR65 16560_CR64 16560_CR62 16560_CR72 16560_CR70 HJ Escalante (16560_CR51) 2010; 114 16560_CR57 Z Song (16560_CR47) 2021; 35 16560_CR56 16560_CR55 16560_CR54 16560_CR52 16560_CR61 16560_CR60 P Kinghorn (16560_CR50) 2018; 272
References_xml	– reference: Denkowski M, Lavie A (2014) Meteor universal: Language specific translation evaluation for any target language. In: Proceedings of the Ninth Workshop on Statistical Machine Translation, pp. 376-380 – reference: YangMLiuJShenYZhaoZChenXWuQLiCAn ensemble of generation-and retrieval-based image captioning with dual generator generative adversarial networkIEEE Transactions on Image Processing202029962796402020ITIP...29.9627Y416865210.1109/TIP.2020.3028651 – reference: Mao J, Huang J, Toshev A, Camburu O, Yuille AL, Murphy K (2016) Generation and comprehension of unambiguous object descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 11-20 – reference: He S, Liao W, Tavakoli HR, Yang M, Rosenhahn B, Pugeault N (2020) Image captioning through image transformer. In: Proceedings of the Asian Conference on Computer Vision – reference: Guo L, Liu J, Tang J, Li J, Luo W, Lu H (2019) Aligning linguistic words and visual semantic units for image captioning. In: Proceedings of the 27th ACM International Conference on Multimedia, pp. 765-773 – reference: Pan Y, Yao T, Li Y, Mei T (2020) X-linear attention networks for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10971-10980 – reference: Stefanini, M., Cornia, M., Baraldi, L., Cascianelli, S., Fiameni, G., Cucchiara, R.: From show to tell: A survey on image captioning. arXiv preprint arXiv:2107.06912 (2021) – reference: Lee K-H, Chen X, Hua G, Hu H, He X (2018) Stacked cross attention for imagetext matching. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 201-216 – reference: Hendricks LA, Venugopalan S, Rohrbach M, Mooney R, Saenko K, Darrell T (2016) Deep compositional captioning: Describing novel object categories without paired training data. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1-10 – reference: LiNChenZLiuSMeta learning for image captioningProceedings of the AAAI Conference on Artificial Intelligence2019338626863310.1609/aaai.v33i01.33018626 – reference: Demner-FushmanDKohliMDRosenmanMBShooshanSERodriguezLAntaniSThomaGRMcDonaldCJPreparing a collection of radiology examinations for distribution and retrievalJ Am Med Inf Assoc201623230431010.1093/jamia/ocv080 – reference: Papineni K, Roukos S, Ward T, Zhu W-J (2002) Bleu: a method for automatic evaluation of machine translation. In: Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311-318 – reference: SocherRKarpathyALeQVManningCDNgAYGrounded compositional semantics for finding and describing images with sentencesTransactions of the Association for Computational Linguistics2014220721810.1162/tacl_a_00177 – reference: Yao T, Pan Y, Li Y, Qiu Z, Mei T (2017) Boosting image captioning with attributes. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 4894-4902 – reference: SongZZhouXMaoZTanJImage captioning with context-aware auxiliary guidanceProceedings of the AAAI Conference on Artificial Intelligence2021352584259210.1609/aaai.v35i3.16361 – reference: Pan J-Y, Yang H-J, Duygulu P, Faloutsos C (2004) Automatic image captioning. In: 2004 IEEE International Conference on Multimedia and Expo (ICME)(IEEE Cat. No. 04TH8763), vol. 3, pp. 1987-1990. IEEE – reference: YoungPLaiAHodoshMHockenmaierJFrom image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptionsTransactions of the Association for Computational Linguistics20142677810.1162/tacl_a_00166 – reference: Jia X, Gavves E, Fernando B, Tuytelaars T (2015) Guiding the long-short term memory model for image caption generation. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2407-2415 – reference: YuJLiJYuZHuangQMultimodal transformer with multi-view visual representation for image captioningIEEE Trans Circ Syst Video Technol201930124467448010.1109/TCSVT.2019.2947482 – reference: LiJWangJZAutomatic linguistic indexing of pictures by a statistical modeling approachIEEE Transactions on pattern analysis and machine intelligence20032591075108810.1109/TPAMI.2003.1227984 – reference: Mao J, Xu W, Yang Y,Wang J, Huang Z, Yuille A (2014) Deep captioning with multimodal recurrent neural networks (m-rnn). arXiv preprint arXiv:1412.6632 – reference: Chen, X., Zitnick, C.L.: Learning a recurrent visual representation for image caption generation. arXiv preprint arXiv:1411.5654 (2014) – reference: Zhou Y, Wang M, Liu D, Hu Z, Zhang H (2020) More grounded image captioning by distilling image-text matching model. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4777-4786 – reference: KinghornPZhangLShaoLA region-based image caption generator with refined descriptionsNeurocomputing201827241642410.1016/j.neucom.2017.07.014 – reference: Li S, Kulkarni G, Berg T, Berg A, Choi Y (2011) Composing simple image descriptions using web-scale n-grams. In: Proceedings of the Fifteenth Conference on Computational Natural Language Learning, pp. 220-228 – reference: Kim D-J, Choi J, Oh T-H, Kweon IS (2019) Dense relational captioning: Triple-stream networks for relationship-based captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 6271-6280 – reference: Kumar D, Gehani S, Oza P (2020) A review of deep learning based image captioning models – reference: Hessel J, Savva N, Wilber MJ (2015) Image representations and new domains in neural image captioning. arXiv preprint arXiv:1508.02091 – reference: Zheng Y, Li Y, Wang S (2019) Intention oriented image captions with guiding objects. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8395-8404 – reference: Rennie SJ, Marcheret E, Mroueh Y, Ross J, Goel V (2017) Self-critical sequence training for image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 7008-7024 – reference: Li L, Gan Z, Cheng Y, Liu J (2019) Relation-aware graph attention network for visual question answering. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 10313-10322 – reference: Stefanini M, Cornia M, Baraldi L, Cascianelli S, Fiameni G, Cucchiara R (2022) From show to tell: A survey on deep learning-based image captioning. IEEE Transactions on Pattern Analysis and Machine Intelligence – reference: Van Miltenburg E (2016) Stereotyping and bias in the flickr30k dataset. arXiv preprint arXiv:1605.06083 – reference: BaiSAnSA survey on automatic image caption generationNeurocomputing201831129130410.1016/j.neucom.2018.05.080 – reference: Héde, P., Moëllic, P.-A., Bourgeoys, J., Joint, M., Thomas, C.: Automatic generation of natural language description for images. In: RIAO, pp. 306-313 (2004). Citeseer – reference: Anderson P, He X, Buehler C, Teney D, Johnson M, Gould S, Zhang L (2018) Bottom-up and top-down attention for image captioning and visual question answering. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 6077-6086 – reference: Xu K, Ba J, Kiros R, Cho K, Courville A, Salakhudinov R, Zemel R, Bengio Y (2015) Show, attend and tell: Neural image caption generation with visual attention. In: International Conference on Machine Learning, pp. 2048-2057. PMLR – reference: Jeon, J., Lavrenko, V., Manmatha, R.: Automatic image annotation and retrieval using cross-media relevance models. In: Proceedings of the 26th Annual International ACM SIGIR Conference on Research and Development in Informaion Retrieval, pp. 119-126 (2003) – reference: Gonog L, Zhou Y (2019) A review: generative adversarial networks. In: 2019 14th IEEE Conference on Industrial Electronics and Applications (ICIEA), pp. 505- 510. IEEE – reference: Mao, J., Wei, X., Yang, Y., Wang, J., Huang, Z., Yuille, A.L.: Learning like a child: Fast novel visual concept learning from sentence descriptions of images. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 2533-2541 (2015) – reference: XiaoXWangLDingKXiangSPanCDense semantic embedding network for image captioningPattern Recognition2019902852962019PatRe..90..285X10.1016/j.patcog.2019.01.028 – reference: Cornia M, Stefanini M, Baraldi L, Cucchiara R (2020) Meshed-memory transformer for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 10578-10587 – reference: Wang, H., Zhang, Y., Yu, X.: An overview of image caption generation methods. Computational intelligence and neuroscience 2020 (2020) – reference: Chen, F., Li, X., Tang, J., Li, S., Wang, T.: A survey on recent advances in image captioning. In: Journal of Physics: Conference Series, vol. 1914, p. 012053 (2021). IOP Publishing – reference: KimD-JOhT-HChoiJKweonISDense relational image captioning via multi-task triple-stream networksIEEE Transactions on Pattern Analysis and Machine Intelligence202144117348736210.1109/TPAMI.2021.3119754 – reference: Wikipedia contributors (2022) Photo caption - Wikipedia, The Free Encyclopedia. [Online; accessed 28-February-2022] – reference: Elhagry, A., Kadaoui, K.: A thorough review on recent deep learning methodologies for image captioning. arXiv preprint arXiv:2107.13114 (2021) – reference: Mun J, Cho M, Han B (2017) Text-guided attention model for image captioning. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 31 – reference: Lin T-Y, Maire M, Belongie S, Hays J, Perona P, Ramanan D, Dollár P, Zitnick CL (2014) Microsoft coco: Common objects in context. In: European Conference on Computer Vision, pp. 740-755. Springer – reference: by Saheel, S.: Baby talk: Understanding and generating image descriptions – reference: Mitchell M, Dodge J, Goyal A, Yamaguchi K, Stratos K, Han X, Mensch A, Berg A, Berg T, Daumé III, H (2012) Midge: Generating image descriptions from computer vision detections. In: Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics, pp. 747-756 – reference: Mathews A, Xie L, He X (2016) Senticap: Generating image descriptions with sentiments. In: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 30 – reference: Devlin J, Gupta S, Girshick R, Mitchell M, Zitnick CL (2015) Exploring nearest neighbor approaches for image captioning. arXiv preprint arXiv:1505.04467 – reference: Zhang P, Li X, Hu X, Yang J, Zhang L, Wang L, Choi Y, Gao J (2021) Vinvl: Revisiting visual representations in vision-language models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 5579-5588 – reference: Sundaramoorthy C, Kelvin LZ, Sarin M, Gupta S (2021) End-to-end attentionbased image captioning. arXiv preprint arXiv:2104.14721 – reference: Elliott D, Keller F (2013) Image description using visual dependency representations. In: Proceedings of the 2013 Conference on Empirical Methods in NaturalLanguage Processing, pp. 1292-1302 – reference: Chen X, Lawrence Zitnick C (2015) Mind’s eye: A recurrent visual representation for image caption generation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2422-2431 – reference: ZhouLPalangiHZhangLHuHCorsoJGaoJUnified visionlanguage pre-training for image captioning and vqaProceedings of the AAAI Conference on Artificial Intelligence202034130411304910.1609/aaai.v34i07.7005 – reference: EscalanteHJHernándezCAGonzalezJALópez-LópezAMontesMMoralesEFSucarLEVillasenorLGrubingerMThe segmented and annotated iapr tc-12 benchmarkComputer vision and image understanding2010114441942810.1016/j.cviu.2009.03.008 – reference: ChenCMuSXiaoWYeZWuLJuQImproving image captioning with conditional generative adversarial netsProceedings of the AAAI Conference on Artificial Intelligence2019338142815010.1609/aaai.v33i01.33018142 – reference: Agrawal H, Desai K, Wang Y, Chen X, Jain R, Johnson M, Batra D, Parikh D, Lee S, Anderson P (2019) Nocaps: Novel object captioning at scale. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8948-8957 – reference: Johnson J, Karpathy A, Fei-Fei L (2016) Densecap: Fully convolutional localization networks for dense captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4565-4574 – reference: Feng Y, Ma L, Liu W, Luo J (2019) Unsupervised image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4125-4134 – reference: Aneja J, Deshpande A, Schwing AG (2018) Convolutional image captioning. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5561-5570 – reference: Changpinyo S, Sharma P, Ding N, Soricut R (2021) Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3558-3568 – reference: Khan R, Islam, MS, Kanwal K, Iqbal M, Hossain M, Ye Z et al (2022) A deep neural framework for image caption generation using gru-based attention mechanism. arXiv preprint arXiv:2203.01594 – reference: Shao Z, Han J, Debattista K, Pang Y (2023) Textual context-aware dense captioning with diverse words. IEEE Transactions on Multimedia – reference: Lin C-Y (2004) Rouge: A package for automatic evaluation of summaries. In: Text Summarization Branches Out, pp. 74-81 – reference: Anderson P, Fernando B, Johnson M, Gould S (2016) Spice: Semantic propositional image caption evaluation. In: European Conference on Computer Vision, pp. 382-398. Springer – reference: Sidorov O, Hu R, Rohrbach M, Singh A (2020) Textcaps: a dataset for image captioning with reading comprehension. In: European Conference on Computer Vision, pp. 742-758. Springer – reference: Mason R, Charniak E (2014) Domain-specific image captioning. In: Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pp. 11-20 – reference: Xiong Y, Du B, Yan P (2019) Reinforced transformer for medical image captioning. In: Machine Learning in Medical Imaging: 10th International Workshop, MLMI 2019, Held in Conjunction with MICCAI 2019, Shenzhen, China, October 13, 2019, Proceedings 10, pp. 673-680. Springer – reference: Yao T, Pan Y, Li Y, Mei T (2018) Exploring visual relationship for image captioning. In: Proceedings of the European Conference on Computer Vision (ECCV), pp. 684-699 – reference: Tanti M, Gatt A, Camilleri KP (2017) What is the role of recurrent neural networks (rnns) in an image caption generator? arXiv preprint arXiv:1708.02043 – reference: Liu S, Zhu Z, Ye N, Guadarrama S, Murphy K (2017) Improved image captioning via policy gradient optimization of spider. In: Proceedings of the IEEE International Conference on Computer Vision, pp. 873-881 – reference: WangJXuWWangQChanABOn distinctive image captioning via comparing and reweightingIEEE Transactions on Pattern Analysis and Machine Intelligence20224522088210310.1109/TPAMI.2022.3159811 – reference: Kuznetsova P, Ordonez V, Berg A, Berg T, Choi Y (2012) Collective generation of natural image descriptions. In: Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 359-368 – reference: SpratlingMWJohnsonMHA feedback model of visual attentionJournal of cognitive neuroscience20041622192371:STN:280:DC%2BD2c7nsVyksA%3D%3D10.1162/08989290432298452615068593 – reference: Janakiraman J, Unnikrishnan K (1992) A feedback model of visual attention. In: [Proceedings 1992] IJCNN International Joint Conference on Neural Networks, vol. 3, pp. 541-546. IEEE – reference: Hu X, Gan Z, Wang J, Yang Z, Liu Z, Lu Y, Wang L (2022) Scaling up visionlanguage pre-training for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 17980-17989 – reference: Shao Z, Han J, Marnerides D, Debattista K (2022) Region-object relation-aware dense captioning via transformer. IEEE Transactions on Neural Networks and Learning Systems – reference: Venugopalan S, Anne Hendricks L, Rohrbach M, Mooney R, Darrell T, Saenko K (2017) Captioning images with diverse objects. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5753-5761 – reference: Sharma P, Ding N, Goodman S, Soricut R (2018) Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In: Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp. 2556-2565 – reference: Hsu T-Y, Giles CL, Huang T-H (2021) Scicap: Generating captions for scientific figures. arXiv preprint arXiv:2110.11624 – reference: ZhouLXuCKochPCorsoJJWatch what you just said: Image captioning with text-conditional attentionProceedings of the on Thematic Workshops of ACM Multimedia2017201730531310.1145/3126686.3126717 – reference: Yoshikawa Y, Shigeto Y, Takeuchi A (2017) Stair captions: Constructing a largescale japanese image caption dataset. arXiv preprint arXiv:1705.00823 – reference: Hu X, Yin X, Lin K, Wang L, Zhang L, Gao J, Liu Z (2020) Vivo: Visual vocabulary pre-training for novel object captioning. arXiv preprint arXiv:2009.13682 – reference: Chen J, Guo H, Yi K, Li B, Elhoseiny M (2022) Visualgpt: Data-efficient adaptation of pretrained language models for image captioning. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 18030-18040 – reference: Vedantam R, Lawrence Zitnick C, Parikh D (2015) Cider: Consensus-based image description evaluation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4566-4575 – reference: Han S-H, Choi H-J (2020) Domain-specific image caption generator with semantic ontology. In: 2020 IEEE International Conference on Big Data and Smart Computing (BigComp), pp. 526-530. IEEE – reference: Ilse M, Tomczak J, Welling M (2018) Attention-based deep multiple instance learning. In: International Conference on Machine Learning, pp. 2127-2136. PMLR – reference: Tan YH, Chan CS (2016) Phi-lstm: a phrase-based hierarchical lstm model for image captioning. In: Asian Conference on Computer Vision, pp. 101-117 Springer – reference: Karpathy A, Fei-Fei L (2015) Deep visual-semantic alignments for generating image descriptions. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 3128-3137 – reference: LuoYJiJSunXCaoLWuYHuangFLinC-WJiRDual-level collaborative transformer for image captioningProceedings of the AAAI Conference on Artificial Intelligence2021352286229310.1609/aaai.v35i3.16328 – reference: Lebret R, Pinheiro PO, Collobert R (2014) Simple image description generator via a linear phrase-based approach. arXiv preprint arXiv:1412.8419 – reference: Li X, Yin X, Li C, Zhang P, Hu X, Zhang L, Wang L, Hu H, Dong L, Wei F et al (2020) Oscar: Object-semantics aligned pre-training for visionlanguage tasks. In: European Conference on Computer Vision, pp. 121-137. Springer – reference: Anitha Kumari K, Mouneeshwari C, Udhaya R, Jasmitha R (2019) Automated image captioning for flickr8k dataset. In: International Conference on Artificial Intelligence, Smart Grid and Smart City Applications, pp. 679-687. Springer – reference: Ordonez, V., Kulkarni, G., Berg, T.: Im2text: Describing images using 1 million captioned photographs. Advances in neural information processing systems 24 (2011) – reference: You Q, Jin H, Wang Z, Fang C, Luo J (2016) Image captioning with semantic attention. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4651-4659 – reference: SeoPHSharmaPLevinboimTHanBSoricutRReinforcing an image caption generator using off-line human feedbackProceedings of the AAAI Conference on Artificial Intelligence2020342693270010.1609/aaai.v34i03.5655 – reference: TanYHChanCSPhrase-based image caption generator with hierarchical lstm networkNeurocomputing20193338610010.1016/j.neucom.2018.12.026 – reference: Li G, Zhu L, Liu P, Yang Y (2019) Entangled transformer for image captioning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8928-8937 – reference: Sharma G, Kalena P, Malde N, Nair A, Parkar S (2019) Visual image caption generator using deep learning. In: 2nd International Conference on Advances in Science & Technology (ICAST) – volume: 34 start-page: 2693 year: 2020 ident: 16560_CR22 publication-title: Proceedings of the AAAI Conference on Artificial Intelligence doi: 10.1609/aaai.v34i03.5655 – ident: 16560_CR25 doi: 10.1109/CVPR.2015.7298856 – volume: 90 start-page: 285 year: 2019 ident: 16560_CR71 publication-title: Pattern Recognition doi: 10.1016/j.patcog.2019.01.028 – ident: 16560_CR91 doi: 10.1109/ICCV.2019.00904 – ident: 16560_CR57 doi: 10.1007/978-3-030-24051-6_62 – ident: 16560_CR75 doi: 10.1109/ICCV.2019.01041 – ident: 16560_CR54 doi: 10.1007/978-3-319-54193-8_7 – ident: 16560_CR30 doi: 10.18653/v1/W17-3506 – ident: 16560_CR62 doi: 10.1109/TPAMI.2022.3148210 – ident: 16560_CR6 doi: 10.1109/ICCV.2015.291 – ident: 16560_CR45 doi: 10.1007/978-3-030-01225-0_13 – ident: 16560_CR4 – ident: 16560_CR48 doi: 10.1109/ICCV.2017.100 – volume: 333 start-page: 86 year: 2019 ident: 16560_CR53 publication-title: Neurocomputing doi: 10.1016/j.neucom.2018.12.026 – ident: 16560_CR24 – ident: 16560_CR27 doi: 10.1609/aaai.v30i1.10475 – ident: 16560_CR43 doi: 10.1109/CVPR42600.2020.01098 – ident: 16560_CR13 – ident: 16560_CR3 – ident: 16560_CR93 doi: 10.18653/v1/2021.findings-emnlp.277 – ident: 16560_CR39 doi: 10.1007/978-3-030-01264-9_42 – volume: 2 start-page: 67 year: 2014 ident: 16560_CR90 publication-title: Transactions of the Association for Computational Linguistics doi: 10.1162/tacl_a_00166 – ident: 16560_CR7 – ident: 16560_CR18 doi: 10.18653/v1/W15-2807 – ident: 16560_CR21 – ident: 16560_CR95 doi: 10.1109/CVPR.2016.9 – ident: 16560_CR70 doi: 10.1007/978-3-030-32692-0_77 – volume: 16 start-page: 219 issue: 2 year: 2004 ident: 16560_CR42 publication-title: Journal of cognitive neuroscience doi: 10.1162/089892904322984526 – ident: 16560_CR31 doi: 10.1109/CVPR.2018.00583 – ident: 16560_CR29 – volume: 34 start-page: 13041 year: 2020 ident: 16560_CR63 publication-title: Proceedings of the AAAI Conference on Artificial Intelligence doi: 10.1609/aaai.v34i07.7005 – ident: 16560_CR12 – ident: 16560_CR2 doi: 10.1088/1742-6596/1914/1/012053 – ident: 16560_CR67 doi: 10.1109/CVPR42600.2020.01059 – ident: 16560_CR15 doi: 10.3115/v1/W14-1602 – ident: 16560_CR77 – volume: 311 start-page: 291 year: 2018 ident: 16560_CR40 publication-title: Neurocomputing doi: 10.1016/j.neucom.2018.05.080 – ident: 16560_CR81 doi: 10.1109/CVPR52688.2022.01750 – volume: 29 start-page: 9627 year: 2020 ident: 16560_CR59 publication-title: IEEE Transactions on Image Processing doi: 10.1109/TIP.2020.3028651 – ident: 16560_CR74 doi: 10.1109/CVPR.2016.494 – ident: 16560_CR9 doi: 10.1109/CVPR.2015.7298856 – ident: 16560_CR92 doi: 10.18653/v1/P17-2066 – ident: 16560_CR26 doi: 10.1109/CVPR.2015.7298932 – ident: 16560_CR97 doi: 10.3115/1073083.1073135 – ident: 16560_CR5 doi: 10.1155/2020/3062706 – ident: 16560_CR72 doi: 10.1109/CVPR.2019.00643 – ident: 16560_CR99 – ident: 16560_CR46 doi: 10.1109/CVPR.2017.131 – ident: 16560_CR85 doi: 10.1609/aaai.v35i2.16249 – ident: 16560_CR38 doi: 10.1109/CVPR.2018.00636 – ident: 16560_CR17 – ident: 16560_CR82 doi: 10.18653/v1/P18-1238 – ident: 16560_CR84 doi: 10.1007/978-3-030-58577-8_8 – ident: 16560_CR76 doi: 10.1109/TMM.2023.3241517 – volume: 33 start-page: 8626 year: 2019 ident: 16560_CR88 publication-title: Proceedings of the AAAI Conference on Artificial Intelligence doi: 10.1609/aaai.v33i01.33018626 – ident: 16560_CR61 – ident: 16560_CR103 – ident: 16560_CR35 doi: 10.1109/ICCV.2015.277 – ident: 16560_CR98 doi: 10.3115/v1/W14-3348 – ident: 16560_CR60 doi: 10.1109/CVPR.2019.00425 – ident: 16560_CR37 – volume: 272 start-page: 416 year: 2018 ident: 16560_CR50 publication-title: Neurocomputing doi: 10.1016/j.neucom.2017.07.014 – volume: 114 start-page: 419 issue: 4 year: 2010 ident: 16560_CR51 publication-title: Computer vision and image understanding doi: 10.1016/j.cviu.2009.03.008 – ident: 16560_CR14 – ident: 16560_CR52 – ident: 16560_CR68 doi: 10.1109/ICCV.2019.00902 – ident: 16560_CR100 doi: 10.1109/CVPR.2015.7299087 – ident: 16560_CR8 – ident: 16560_CR78 doi: 10.2139/ssrn.3368837 – ident: 16560_CR20 – ident: 16560_CR96 doi: 10.1109/CVPR46437.2021.00356 – ident: 16560_CR49 – ident: 16560_CR55 – ident: 16560_CR86 doi: 10.1109/ICIEA.2019.8833686 – ident: 16560_CR44 doi: 10.1109/CVPR42600.2020.00483 – ident: 16560_CR41 doi: 10.1109/IJCNN.1992.227117 – ident: 16560_CR64 doi: 10.1109/CVPR46437.2021.00553 – volume: 2017 start-page: 305 year: 2017 ident: 16560_CR34 publication-title: Proceedings of the on Thematic Workshops of ACM Multimedia doi: 10.1145/3126686.3126717 – ident: 16560_CR36 doi: 10.1609/aaai.v31i1.11237 – ident: 16560_CR65 doi: 10.1109/CVPR52688.2022.01745 – ident: 16560_CR16 doi: 10.1109/BigComp48618.2020.00-12 – ident: 16560_CR19 – volume: 35 start-page: 2286 year: 2021 ident: 16560_CR89 publication-title: Proceedings of the AAAI Conference on Artificial Intelligence doi: 10.1609/aaai.v35i3.16328 – ident: 16560_CR80 doi: 10.1109/CVPR.2017.130 – ident: 16560_CR102 doi: 10.1007/978-3-319-46454-1_24 – ident: 16560_CR66 doi: 10.1007/978-3-030-69538-5_10 – volume: 25 start-page: 1075 issue: 9 year: 2003 ident: 16560_CR11 publication-title: IEEE Transactions on pattern analysis and machine intelligence doi: 10.1109/TPAMI.2003.1227984 – volume: 30 start-page: 4467 issue: 12 year: 2019 ident: 16560_CR69 publication-title: IEEE Trans Circ Syst Video Technol doi: 10.1109/TCSVT.2019.2947482 – ident: 16560_CR94 doi: 10.1007/978-3-030-58536-5_44 – volume: 44 start-page: 7348 issue: 11 year: 2021 ident: 16560_CR73 publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence doi: 10.1109/TPAMI.2021.3119754 – ident: 16560_CR23 doi: 10.1109/CVPR.2019.00859 – volume: 33 start-page: 8142 year: 2019 ident: 16560_CR87 publication-title: Proceedings of the AAAI Conference on Artificial Intelligence doi: 10.1609/aaai.v33i01.33018142 – ident: 16560_CR56 doi: 10.1007/978-3-319-10602-1_48 – ident: 16560_CR32 doi: 10.1145/3343031.3350943 – ident: 16560_CR33 doi: 10.1109/CVPR.2016.503 – volume: 45 start-page: 2088 issue: 2 year: 2022 ident: 16560_CR101 publication-title: IEEE Transactions on Pattern Analysis and Machine Intelligence doi: 10.1109/TPAMI.2022.3159811 – volume: 23 start-page: 304 issue: 2 year: 2016 ident: 16560_CR83 publication-title: J Am Med Inf Assoc doi: 10.1093/jamia/ocv080 – ident: 16560_CR28 doi: 10.1109/ICCV.2017.524 – ident: 16560_CR10 doi: 10.1145/860435.860459 – ident: 16560_CR1 – ident: 16560_CR79 doi: 10.1109/CVPR.2016.8 – volume: 35 start-page: 2584 year: 2021 ident: 16560_CR47 publication-title: Proceedings of the AAAI Conference on Artificial Intelligence doi: 10.1609/aaai.v35i3.16361 – volume: 2 start-page: 207 year: 2014 ident: 16560_CR58 publication-title: Transactions of the Association for Computational Linguistics doi: 10.1162/tacl_a_00177
SSID	ssj0016524
Score	2.355032
Snippet	Image - Caption Generator is a popular Artificial Intelligence research tool that works with image comprehension and language definition. Creating...
SourceID	proquest crossref springer
SourceType	Aggregation Database Enrichment Source Index Database Publisher
StartPage	28077
SubjectTerms	Artificial intelligence Computer Communication Networks Computer Science Computer vision Data Structures and Information Theory Machine learning Multimedia Information Systems Natural language Natural language processing Recurrent neural networks Special Purpose and Application-Based Systems Track 6: Computer Vision for Multimedia Applications
SummonAdditionalLinks	– databaseName: SpringerLink Journals (ICM) dbid: U2A link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwlV1JS8NAFB6kXvTgUhWrVebgTQeSyWQZD0IpliroyUJvYdaTTaRJRf-9b7I0Kip4zkwOb97yPb63IHRheBJYZhmJrHQ0Y8QJ17EivuTSF7EnPeUY3YfHaDpj9_Nw3jSFFW21e0tJVp66a3bzXSsJxBhSTYwhgBw3Q5e7gxbP6GjNHUQhZU17zM_3voagDld-o0KrCDPZQzsNNMSj-i330YbJ-mi3XbuAGyvso-1PMwQP0M1kmS9wvQi6wGWOXc1nYcriGo9wsVq-mnecZ_huAY6DjEXlIXA9bNot2jlEs8nt03hKmqUIRIG1lCTUwnhMRFRxrWmiqYoFTyAlBvASG0qtJ0PP14ILS4VlkkK-AzmD1Y5hiZkJjlAvyzNzjHCgEgsIQ_kaArVmlkNuE8kokUJpG9lggPxWTqlqJoa7xRXPaTfr2Mk2BdmmlWzTtwG6XN95qedl_Hl62Io_bWynSAGABK5LjnsDdNU-Sff597-d_O_4KdoC7WF1QdkQ9crlypwBwijleaVQHzDiyAw priority: 102 providerName: Springer Nature
Title	From methods to datasets: A survey on Image-Caption Generators
URI	https://link.springer.com/article/10.1007/s11042-023-16560-x https://www.proquest.com/docview/2933270290
Volume	83
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3NT9swFH8a9LIdGGNMlEHlw25gLXHSJOYw1KIGNkSF0CrBKfLnCRpoAoL_fs-JQwFpnHJwbMnPfl9-Hz-AH4ZnkY1tTBMrXZgx4ZTrVNFQchmKNJCBchHds2lyMov_XA4v_YNb5dMqO5nYCGpdKvdG_hPVUuRqp3hweHtHHWqUi656CI0V6KEIztD56o0n0_OL5zhCMvSwtllAUTeGvmymLZ4LXWkK6izadKChj69V09LefBMibTRPvg5r3mQko_aMv8AHM9-Azx0cA_HcuQGfXvQW_Aq_8kV5Q1qA6IrUJXG5oJWpqwMyItX94sE8kXJOft-gQKFHopEcpG1C7QB4NmGWT_4enVAPlkAVclFNh1qYIBYJU1xrlmmmUsEzdJXRqEkNYzaQuH8tuLBM2Fgy9IPQl7DaRV7S2ETfYHVezs0WkEhlFi0PFWpU4Dq2HH2eRCaZFErbxEZ9CDs6Fcp3EneAFtfFsgeyo22BtC0a2haPfdh7nnPb9tF49--djvyF56mqWN6APux3R7Ic_v9q2--v9h0-Mtxom1i2A6v14t7soqVRywGsZPnxAHqjfDyeuu_x1elk4C8Zjs7Y6B-QGtK9
linkProvider	ProQuest
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB4BPUAPtLzUUCh7KCdYYa8d24tEUUQJSXmcQOJm9nkicYgdCn-qv7GzfiQtEtw4215pZ2fn4W9mPoDvhieBDW1IIysdzBhxynWsqC-59EXsSU85RPfyKurdhL9u27dz8KfphXFllY1NLA21zpT7R36AbilwvVPcOx49UMca5dDVhkKjUotz8_wbU7b8qP8Tz3eXse7p9UmP1qwCVKG6FbSthfFCETHFtWaJZioWPMGcEr1_bBiznmx7vhZcWCZsKBkmDBh0W-0gijg0Aa47Dx_CIODuRiXdsylqEbVrEt3Eo-iJ_bpJp2rV810jDHpIWs67oU__O8JZdPsCkC39XPczLNcBKulUGrUCc2a4Cp8a8gdS24JV-PjPJMM1-NEdZwNS0VHnpMiIqzzNTZEfkg7JJ-NH80yyIekP0HzRE1HaKVKNvHZ0P-tw8y5C3ICFYTY0X4AEKrEY5yhfY7igQ8sxw4pklEihtI1s0AK_kVOq6rnljj7jPp1NXHayTVG2aSnb9KkFe9NvRtXUjjff3mrEn9Y3OE9n-taC_eZIZo9fX23z7dV2YLF3fXmRXvSvzr_CEsNNVyVtW7BQjCdmG2OcQn4rFYvA3Xtr8l_irQmR
linkToPdf	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwtV1LT9tAEB6FRKrg0AcFkZa2e2hPZYW9dmxvpYJSICKljVAFEjezzxPEEJsW_lp_XWftddIilRtn2yN59vM8PI8P4L3hWWRjG9PESldmTDjlOlU0lFyGIg1koFxF9_skOTyNv54Nzjrwu52FcW2VrU2sDbUulPtHvo1uKXKzUzzYtr4t4nh_tHt1TR2DlKu0tnQaDUSOzN0vTN_Kz-N9POsPjI0OTvYOqWcYoAqhV9GBFiaIRcIU15plmqlU8AzzS4wEUsOYDeQgCLXgwjJhY8kwecAA3GpXrkhjE6HcJeilmBUFXeh9OZgc_5jXMJKBp9TNAop-OfQjO83gXujGYtBf0nr7Db391y0uYt175dna642ew1MfrpJhg68X0DHTVXjWUkEQbxlWYeWvvYYvYWc0Ky5JQ05dkqogrg-1NFX5iQxJeTP7ae5IMSXjSzRmdE_UVos0C7Ad-c8anD6KGtehOy2mZgNIpDKLUY8KNQYPOrYc861EJpkUStvERn0IWz3lym8xd2QaF_li_7LTbY66zWvd5rd9-Dh_5qrZ4fHg3Zut-nP_PZf5An192GqPZHH5_9JePSztHTxBFOffxpOj17DM8J2b_rZN6FazG_MGA55KvvXIInD-2GD-A9awDyM
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=From+methods+to+datasets%3A+A+survey+on+Image-Caption+Generators&rft.jtitle=Multimedia+tools+and+applications&rft.au=Agarwal%2C+Lakshita&rft.au=Verma%2C+Bindu&rft.date=2024-03-01&rft.pub=Springer+Nature+B.V&rft.issn=1380-7501&rft.eissn=1573-7721&rft.volume=83&rft.issue=9&rft.spage=28077&rft.epage=28123&rft_id=info:doi/10.1007%2Fs11042-023-16560-x&rft.externalDBID=HAS_PDF_LINK
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1573-7721&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1573-7721&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1573-7721&client=summon