Graph-based image captioning with semantic and spatial features

•Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object rel...

Full description

Saved in:

Bibliographic Details
Published in	Signal processing. Image communication Vol. 133; p. 117273
Main Authors	Parseh, Mohammad Javad, Ghadiri, Saeed
Format	Journal Article
Language	English
Published	Elsevier B.V 01.04.2025
Subjects	Attention mechanism Image captioning Semantic Graph Spatial graph Spatial graph Attention mechanism Image captioning Semantic Graph
Online Access	Get full text

Cover

Loading…

Abstract	•Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object relationship.2.Designs and extracts contextual features using Graph Convolutional Networks, creating both spatial and semantic graphs.3.An LSTM decoder is used to incorporate the visual features from CNN, the graph-based features, and the word embeddings via a multi-modal attention mechanism.•Results: The method shows competitive results compared to state-of-the-art methods and yields contextually aware and accurate descriptions that draw from deeper contextual pools.•Impact: The key methodology enables uses in applications for automatic captioning, scene interpretation, and assistive technology. Image captioning is a challenging task of image processing that aims to generate descriptive and accurate textual descriptions for images. In this paper, we propose a novel image captioning framework that leverages the power of spatial and semantic relationships between objects in an image, in addition to traditional visual features. Our approach integrates a pre-trained model, RelTR, as a backbone for extracting object bounding boxes and subject-predicate-object relationship pairs. We use these extracted relationships to construct spatial and semantic graphs, which are processed through separate Graph Convolutional Networks (GCNs) to obtain high-level contextualized features. At the same time, a CNN model is employed to extract visual features from the input image. To merge the feature vectors seamlessly, our approach involves using a multi-modal attention mechanism that is applied separately to the feature maps of the image, the nodes of the semantic graph, and the nodes of the spatial graph during each time step of the LSTM-based decoder. The model concatenates the attended features with the word embedding at the respective time step and fed into the LSTM cell. Our experiments demonstrate the effectiveness of our proposed approach, which competes closely with existing state-of-the-art image captioning techniques by capturing richer contextual information and generating accurate and semantically meaningful captions. © 2025 Elsevier Inc. All rights reserved.
AbstractList	•Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object relationship.2.Designs and extracts contextual features using Graph Convolutional Networks, creating both spatial and semantic graphs.3.An LSTM decoder is used to incorporate the visual features from CNN, the graph-based features, and the word embeddings via a multi-modal attention mechanism.•Results: The method shows competitive results compared to state-of-the-art methods and yields contextually aware and accurate descriptions that draw from deeper contextual pools.•Impact: The key methodology enables uses in applications for automatic captioning, scene interpretation, and assistive technology. Image captioning is a challenging task of image processing that aims to generate descriptive and accurate textual descriptions for images. In this paper, we propose a novel image captioning framework that leverages the power of spatial and semantic relationships between objects in an image, in addition to traditional visual features. Our approach integrates a pre-trained model, RelTR, as a backbone for extracting object bounding boxes and subject-predicate-object relationship pairs. We use these extracted relationships to construct spatial and semantic graphs, which are processed through separate Graph Convolutional Networks (GCNs) to obtain high-level contextualized features. At the same time, a CNN model is employed to extract visual features from the input image. To merge the feature vectors seamlessly, our approach involves using a multi-modal attention mechanism that is applied separately to the feature maps of the image, the nodes of the semantic graph, and the nodes of the spatial graph during each time step of the LSTM-based decoder. The model concatenates the attended features with the word embedding at the respective time step and fed into the LSTM cell. Our experiments demonstrate the effectiveness of our proposed approach, which competes closely with existing state-of-the-art image captioning techniques by capturing richer contextual information and generating accurate and semantically meaningful captions. © 2025 Elsevier Inc. All rights reserved.
ArticleNumber	117273
Author	Parseh, Mohammad Javad Ghadiri, Saeed
Author_xml	– sequence: 1 givenname: Mohammad Javad orcidid: 0000-0003-0109-3133 surname: Parseh fullname: Parseh, Mohammad Javad email: parseh@jahromu.ac.ir – sequence: 2 givenname: Saeed surname: Ghadiri fullname: Ghadiri, Saeed
BookMark	eNp9j0FPwyAYhjnMxG36C7zwB1opFCgHY8yim8kSL3omFL5uNBttADX-e7fVs6f39Lx5ngWahSEAQncVKStSifu-9Eezg5ISysuqklSyGZoTRVnBleDXaJFSTwihNVFz9LiOZtwXrUng8AXE1ozZD8GHHf72eY8THE3I3mITHE6jyd4ccAcmf0ZIN-iqM4cEt3-7RB8vz--rTbF9W7-unraFpZzlgnPZClM7UivJasoaartWQUdlI8C1ApquEQSAtraWqnFKcemYMGCFoNRQtkRs-rVxSClCp8d40o0_uiL63K17fdHX5249dZ-oh4mCk9qXh6iT9RAsOB_BZu0G_y__CwV2Zc8
Cites_doi	10.1016/j.eswa.2022.118474 10.3390/biology11121732 10.1186/s40537-023-00693-9 10.1016/j.jvcir.2021.103044 10.1145/3065386 10.1162/neco.1997.9.8.1735 10.1016/j.neucom.2022.11.045 10.1145/3439734 10.1162/tacl_a_00166 10.1016/j.cviu.2020.103068 10.1016/j.cviu.2022.103617 10.1016/j.imavis.2022.104575 10.1016/j.jvcir.2021.103138 10.1109/TPAMI.2023.3268066 10.1145/3326362 10.1016/j.ins.2022.12.018 10.1109/TIP.2022.3177318 10.1016/j.imavis.2022.104591 10.1016/j.displa.2023.102377 10.3390/fractalfract7080598
ContentType	Journal Article
Copyright	2025 Elsevier B.V.
Copyright_xml	– notice: 2025 Elsevier B.V.
DBID	AAYXX CITATION
DOI	10.1016/j.image.2025.117273
DatabaseName	CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Engineering Computer Science
ExternalDocumentID	10_1016_j_image_2025_117273 S0923596525000207
GroupedDBID	--K --M .DC .~1 0R~ 123 1B1 1~. 1~5 4.4 457 4G. 5VS 7-5 71M 8P~ 9JN AACTN AAEDT AAEDW AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AAXKI AAXUO AAYFN ABBOA ABDPE ABFNM ABMAC ABWVN ABXDB ACDAQ ACGFS ACNNM ACRLP ACRPL ACZNC ADBBV ADEZE ADJOM ADMUD ADNMO ADTZH AEBSH AECPX AEIPS AEKER AFJKZ AFTJW AGHFR AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIKHN AITUG ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU AOUOD ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 EBS EFJIC EJD EO8 EO9 EP2 EP3 FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HLZ HVGLF HZ~ IHE J1W JJJVA KOM LG9 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. PQQKQ Q38 R2- RIG ROL RPZ SBC SDF SDG SDP SES SEW SPC SPCBC SST SSV SSZ T5K WUQ XPP ZMT ~G- AATTM AAYWO AAYXX AFXIZ AGCQF AGQPQ AGRNS AIIUN APXCP BNPGV CITATION SSH
ID	FETCH-LOGICAL-c253t-557b6a4d0497342382cfb9ef2786edb6e8f860ee2bc4798d9957d36aec6622a23
IEDL.DBID	.~1
ISSN	0923-5965
IngestDate	Tue Jul 01 05:31:37 EDT 2025 Sat Mar 01 15:45:21 EST 2025
IsPeerReviewed	true
IsScholarly	true
Keywords	Spatial graph Attention mechanism Image captioning Semantic Graph
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c253t-557b6a4d0497342382cfb9ef2786edb6e8f860ee2bc4798d9957d36aec6622a23
ORCID	0000-0003-0109-3133
ParticipantIDs	crossref_primary_10_1016_j_image_2025_117273 elsevier_sciencedirect_doi_10_1016_j_image_2025_117273
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	April 2025 2025-04-00
PublicationDateYYYYMMDD	2025-04-01
PublicationDate_xml	– month: 04 year: 2025 text: April 2025
PublicationDecade	2020
PublicationTitle	Signal processing. Image communication
PublicationYear	2025
Publisher	Elsevier B.V
Publisher_xml	– name: Elsevier B.V
References	Simonyan, Zisserman (bib0017) 2014 Hochreiter, Schmidhuber (bib0001) 1997; 9 Zhang (bib0033) 2021 Wei (bib0067) 2020; 201 Guo (bib0028) 2020 Kingma, Ba (bib0064) 2014 Dubey (bib0045) 2023; 623 Sasibhooshan, Kumaraswamy, Sasidharan (bib0071) 2023; 10 Kipf, Welling (bib0005) 2016 Wang (bib0061) 2019; 38 Krizhevsky, Sutskever, Hinton (bib0015) 2017; 60 Donahue (bib0056) 2015 Gao (bib0075) 2019 Shen (bib0037) 2021 Karpathy, Fei-Fei (bib0062) 2015 Herdade, S., et al., Image captioning: transforming objects into words. Advances in neural information processing systems, 2019. 32. Chen (bib0043) 2023; 77 Young (bib0063) 2014; 2 Banerjee, Lavie (bib0010) 2005 Duan (bib0042) 2023; 228 Touvron (bib0035) 2021 He (bib0007) 2016 Cornia (bib0039) 2021 Radford, A., et al., Improving language understanding by generative pre-training. 2018. Huang (bib0029) 2019 Yao (bib0051) 2019 Wang, Gu (bib0044) 2023; 211 Wang, Xu, Sun (bib0074) 2022 Zhong (bib0047) 2021; 78 Zhang, Li, Wang (bib0060) 2021; 75 Li (bib0040) 2020 Özçelik, Altan (bib0003) 2023 Cong, Yang, Rosenhahn (bib0006) 2023 Wei (bib0068) 2021; 17 Hu (bib0046) 2023; 519 Zhang (bib0041) 2023 Yağ, Altan (bib0004) 2022; 11 Anderson (bib0013) 2016 Mao (bib0016) 2014 Anderson (bib0023) 2018 Ge (bib0054) 2019 Papineni (bib0009) 2002 Dosovitskiy (bib0034) 2020 Chen (bib0022) 2017 Liu (bib0036) 2021 ROUGE (bib0011) 2004 Lu (bib0021) 2017 Yang (bib0025) 2019 Hu (bib0070) 2022; 128 Zhou (bib0059) 2020 Jiang (bib0069) 2022; 31 Guo (bib0049) 2019 Bahdanau, Cho, Bengio (bib0019) 2014 Devlin (bib0058) 2018 Vinyals (bib0014) 2015 Li (bib0026) 2019 Jiang (bib0032) 2020 Moral (bib0065) 2022 Xu (bib0020) 2015 Lin (bib0008) 2014 Xiao (bib0052) 2022 Abedi, Karshenas, Adibi (bib0066) 2023 Shi (bib0050) 2020 Vaswani (bib0024) 2017 Yao (bib0048) 2018 Vedantam, Zitnick, Parikh (bib0012) 2015 Özçelik, Altan (bib0002) 2023; 7 Cornia (bib0031) 2020 Zhang (bib0055) 2022 Yang, Liu, Wang (bib0073) 2022 Rennie (bib0018) 2017 Mokady, Hertz, Bermano (bib0038) 2021 Ji (bib0072) 2021 Pan (bib0030) 2020 Li (bib0053) 2023; 129 Li (10.1016/j.image.2025.117273_bib0053) 2023; 129 Sasibhooshan (10.1016/j.image.2025.117273_bib0071) 2023; 10 Papineni (10.1016/j.image.2025.117273_bib0009) 2002 Krizhevsky (10.1016/j.image.2025.117273_bib0015) 2017; 60 Donahue (10.1016/j.image.2025.117273_bib0056) 2015 Wei (10.1016/j.image.2025.117273_bib0068) 2021; 17 Gao (10.1016/j.image.2025.117273_bib0075) 2019 Özçelik (10.1016/j.image.2025.117273_bib0002) 2023; 7 Cong (10.1016/j.image.2025.117273_bib0006) 2023 Yang (10.1016/j.image.2025.117273_bib0025) 2019 Ge (10.1016/j.image.2025.117273_bib0054) 2019 He (10.1016/j.image.2025.117273_bib0007) 2016 Dosovitskiy (10.1016/j.image.2025.117273_bib0034) 2020 Liu (10.1016/j.image.2025.117273_bib0036) 2021 Cornia (10.1016/j.image.2025.117273_bib0031) 2020 Yağ (10.1016/j.image.2025.117273_bib0004) 2022; 11 Guo (10.1016/j.image.2025.117273_bib0049) 2019 Wang (10.1016/j.image.2025.117273_bib0044) 2023; 211 Hochreiter (10.1016/j.image.2025.117273_bib0001) 1997; 9 ROUGE (10.1016/j.image.2025.117273_bib0011) 2004 Zhang (10.1016/j.image.2025.117273_bib0041) 2023 Zhou (10.1016/j.image.2025.117273_bib0059) 2020 Xu (10.1016/j.image.2025.117273_bib0020) 2015 Shi (10.1016/j.image.2025.117273_bib0050) 2020 Vaswani (10.1016/j.image.2025.117273_bib0024) 2017 Duan (10.1016/j.image.2025.117273_bib0042) 2023; 228 Jiang (10.1016/j.image.2025.117273_bib0069) 2022; 31 Kingma (10.1016/j.image.2025.117273_bib0064) 2014 Xiao (10.1016/j.image.2025.117273_bib0052) 2022 Yao (10.1016/j.image.2025.117273_bib0051) 2019 Lin (10.1016/j.image.2025.117273_bib0008) 2014 Bahdanau (10.1016/j.image.2025.117273_bib0019) 2014 Özçelik (10.1016/j.image.2025.117273_bib0003) 2023 Hu (10.1016/j.image.2025.117273_bib0070) 2022; 128 Guo (10.1016/j.image.2025.117273_bib0028) 2020 Wang (10.1016/j.image.2025.117273_bib0061) 2019; 38 Lu (10.1016/j.image.2025.117273_bib0021) 2017 Kipf (10.1016/j.image.2025.117273_bib0005) 2016 Moral (10.1016/j.image.2025.117273_bib0065) 2022 Chen (10.1016/j.image.2025.117273_bib0022) 2017 Zhang (10.1016/j.image.2025.117273_bib0055) 2022 Ji (10.1016/j.image.2025.117273_bib0072) 2021 Anderson (10.1016/j.image.2025.117273_bib0013) 2016 Chen (10.1016/j.image.2025.117273_bib0043) 2023; 77 Dubey (10.1016/j.image.2025.117273_bib0045) 2023; 623 Devlin (10.1016/j.image.2025.117273_bib0058) 2018 10.1016/j.image.2025.117273_bib0057 Anderson (10.1016/j.image.2025.117273_bib0023) 2018 Pan (10.1016/j.image.2025.117273_bib0030) 2020 Wei (10.1016/j.image.2025.117273_bib0067) 2020; 201 Vedantam (10.1016/j.image.2025.117273_bib0012) 2015 Simonyan (10.1016/j.image.2025.117273_bib0017) 2014 Vinyals (10.1016/j.image.2025.117273_bib0014) 2015 Jiang (10.1016/j.image.2025.117273_bib0032) 2020 Li (10.1016/j.image.2025.117273_bib0040) 2020 Zhong (10.1016/j.image.2025.117273_bib0047) 2021; 78 Banerjee (10.1016/j.image.2025.117273_bib0010) 2005 Abedi (10.1016/j.image.2025.117273_bib0066) 2023 Yang (10.1016/j.image.2025.117273_bib0073) 2022 Young (10.1016/j.image.2025.117273_bib0063) 2014; 2 Hu (10.1016/j.image.2025.117273_bib0046) 2023; 519 Yao (10.1016/j.image.2025.117273_bib0048) 2018 Karpathy (10.1016/j.image.2025.117273_bib0062) 2015 Wang (10.1016/j.image.2025.117273_bib0074) 2022 Rennie (10.1016/j.image.2025.117273_bib0018) 2017 Li (10.1016/j.image.2025.117273_bib0026) 2019 Huang (10.1016/j.image.2025.117273_bib0029) 2019 Cornia (10.1016/j.image.2025.117273_bib0039) 2021 Zhang (10.1016/j.image.2025.117273_bib0033) 2021 Shen (10.1016/j.image.2025.117273_bib0037) 2021 Zhang (10.1016/j.image.2025.117273_bib0060) 2021; 75 Mokady (10.1016/j.image.2025.117273_bib0038) 2021 Mao (10.1016/j.image.2025.117273_bib0016) 2014 10.1016/j.image.2025.117273_bib0027 Touvron (10.1016/j.image.2025.117273_bib0035) 2021
References_xml	– year: 2016 ident: bib0007 article-title: Deep residual learning for image recognition publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2021 ident: bib0072 article-title: Improving image captioning by leveraging intra-and inter-layer global representation in transformer network publication-title: Proceedings of the AAAI conference on artificial intelligence – year: 2020 ident: bib0059 article-title: Unified vision-language pre-training for image captioning and vqa publication-title: Proceedings of the AAAI conference on artificial intelligence – year: 2023 ident: bib0003 article-title: Classification of diabetic retinopathy by machine learning algorithm using entorpy-based features publication-title: Proceedings of the ÇAnkaya International Congress on Scientific Research – year: 2002 ident: bib0009 article-title: Bleu: a method for automatic evaluation of machine translation publication-title: Proceedings of the 40th annual meeting of the Association for Computational Linguistics – year: 2014 ident: bib0017 article-title: arXiv preprint – year: 2023 ident: bib0066 article-title: arXiv preprint – year: 2014 ident: bib0019 article-title: arXiv preprint – year: 2020 ident: bib0028 article-title: Normalized and geometry-aware self-attention network for image captioning publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition – year: 2021 ident: bib0035 article-title: Training data-efficient image transformers & distillation through attention publication-title: International conference on machine learning – volume: 201 year: 2020 ident: bib0067 article-title: The synergy of double attention: combine sentence-level and word-level attention for image captioning publication-title: Computer Vis. Image Underst. – volume: 623 start-page: 812 year: 2023 end-page: 831 ident: bib0045 article-title: Label-attention transformer with geometrically coherent objects for image captioning publication-title: Inf Sci (Ny) – year: 2014 ident: bib0016 article-title: arXiv preprint – year: 2017 ident: bib0018 article-title: Self-critical sequence training for image captioning publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2015 ident: bib0062 article-title: Deep visual-semantic alignments for generating image descriptions publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – start-page: 30 year: 2017 ident: bib0024 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – year: 2020 ident: bib0030 article-title: X-linear attention networks for image captioning publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition – year: 2005 ident: bib0010 article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments publication-title: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization – reference: Herdade, S., et al., Image captioning: transforming objects into words. Advances in neural information processing systems, 2019. 32. – year: 2018 ident: bib0048 article-title: Exploring visual relationship for image captioning publication-title: Proceedings of the European conference on computer vision (ECCV) – year: 2023 ident: bib0041 article-title: Cross on cross attention: deep fusion transformer for image captioning publication-title: IEEE Transactions on Circuits and Systems for Video Technology – year: 2014 ident: bib0008 article-title: Microsoft coco: common objects in context publication-title: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13 – year: 2014 ident: bib0064 article-title: arXiv preprint – volume: 519 start-page: 69 year: 2023 end-page: 81 ident: bib0046 article-title: MAENet: a novel multi-head association attention enhancement network for completing intra-modal interaction in image captioning publication-title: Neurocomputing. – volume: 11 start-page: 1732 year: 2022 ident: bib0004 article-title: Artificial intelligence-based robust hybrid algorithm design and implementation for real-time detection of plant diseases in agricultural environments publication-title: Biology. – year: 2017 ident: bib0021 article-title: Knowing when to look: adaptive attention via a visual sentinel for image captioning publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 75 year: 2021 ident: bib0060 article-title: Parallel-fusion LSTM with synchronous semantic and visual information for image captioning publication-title: J. Vis. Commun. Image Represent. – year: 2021 ident: bib0039 article-title: arXiv preprint – year: 2018 ident: bib0058 article-title: arXiv preprint – year: 2019 ident: bib0029 article-title: Attention on attention for image captioning publication-title: Proceedings of the IEEE/CVF international conference on computer vision – volume: 17 start-page: 1 year: 2021 end-page: 22 ident: bib0068 article-title: Integrating scene semantic knowledge into image captioning publication-title: ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) – volume: 129 year: 2023 ident: bib0053 article-title: Modeling graph-structured contexts for image captioning publication-title: Image Vis. Comput. – volume: 31 start-page: 3920 year: 2022 end-page: 3934 ident: bib0069 article-title: Visual cluster grounding for image captioning publication-title: IEEE Trans. Image Proc. – year: 2022 ident: bib0052 article-title: Relational Graph Reasoning Transformer for Image Captioning publication-title: 2022 IEEE International Conference on Multimedia and Expo (ICME) – year: 2015 ident: bib0056 article-title: Long-term recurrent convolutional networks for visual recognition and description publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2019 ident: bib0049 article-title: Aligning linguistic words and visual semantic units for image captioning publication-title: Proceedings of the 27th ACM international conference on multimedia – year: 2020 ident: bib0032 article-title: In defense of grid features for visual question answering publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition – year: 2019 ident: bib0051 article-title: Hierarchy parsing for image captioning publication-title: Proceedings of the IEEE/CVF international conference on computer vision – year: 2023 ident: bib0006 article-title: Reltr: relation transformer for scene graph generation publication-title: IEEe Trans. Pattern. Anal. Mach. Intell. – year: 2015 ident: bib0020 article-title: Show, attend and tell: neural image caption generation with visual attention publication-title: International conference on machine learning – volume: 228 year: 2023 ident: bib0042 article-title: Cross-domain multi-style merge for image captioning publication-title: Computer Vision and Image Understanding – year: 2015 ident: bib0014 article-title: Show and tell: a neural image caption generator publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 78 year: 2021 ident: bib0047 article-title: Attention-guided image captioning with adaptive global and local feature fusion publication-title: J. Vis. Commun. Image Represent. – volume: 60 start-page: 84 year: 2017 end-page: 90 ident: bib0015 article-title: Imagenet classification with deep convolutional neural networks publication-title: Commun ACM – year: 2019 ident: bib0054 article-title: Exploring overall contextual information for image captioning in human-like cognitive style publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision – year: 2022 ident: bib0065 publication-title: Automated Image Captioning with Multi-layer Gated Recurrent Unit. in 2022 30th European Signal Processing Conference (EUSIPCO) – volume: 77 year: 2023 ident: bib0043 article-title: Relational-Convergent Transformer for image captioning publication-title: Displays – volume: 38 start-page: 1 year: 2019 end-page: 12 ident: bib0061 article-title: Dynamic graph cnn for learning on point clouds publication-title: Acm Trans. Graphics (tog) – year: 2022 ident: bib0074 article-title: End-to-end transformer based model for image captioning publication-title: Proceedings of the AAAI Conference on Artificial Intelligence – year: 2017 ident: bib0022 article-title: Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2018 ident: bib0023 article-title: Bottom-up and top-down attention for image captioning and visual question answering publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – year: 2020 ident: bib0031 article-title: Meshed-memory transformer for image captioning publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition – year: 2020 ident: bib0034 article-title: arXiv preprint – year: 2020 ident: bib0040 article-title: Oscar: object-semantics aligned pre-training for vision-language tasks publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16 – volume: 9 start-page: 1735 year: 1997 end-page: 1780 ident: bib0001 article-title: Long short-term memory publication-title: Neural Comput. – reference: Radford, A., et al., Improving language understanding by generative pre-training. 2018. – volume: 7 start-page: 598 year: 2023 ident: bib0002 article-title: Overcoming nonlinear dynamics in diabetic retinopathy classification: a robust AI-based model with chaotic swarm intelligence optimization and recurrent long short-term memory publication-title: Fract. Fraction. – year: 2022 ident: bib0055 article-title: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s – year: 2021 ident: bib0036 article-title: arXiv preprint – year: 2019 ident: bib0026 article-title: Entangled transformer for image captioning publication-title: Proceedings of the IEEE/CVF international conference on computer vision – year: 2021 ident: bib0037 article-title: arXiv preprint – year: 2019 ident: bib0025 article-title: Auto-encoding scene graphs for image captioning publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition – volume: 2 start-page: 67 year: 2014 end-page: 78 ident: bib0063 article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions publication-title: Trans. Assoc. Comput. Linguist. – year: 2004 ident: bib0011 article-title: A package for automatic evaluation of summaries publication-title: Proceedings of Workshop on Text Summarization of ACL – year: 2015 ident: bib0012 article-title: Cider: consensus-based image description evaluation publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – volume: 10 start-page: 18 year: 2023 ident: bib0071 article-title: Image caption generation using visual attention prediction and contextual spatial relation extraction publication-title: J. Big. Data – year: 2019 ident: bib0075 article-title: Deliberate attention networks for image captioning publication-title: Proceedings of the AAAI conference on artificial intelligence – year: 2016 ident: bib0005 article-title: arXiv preprint – year: 2021 ident: bib0038 article-title: arXiv preprint – volume: 211 year: 2023 ident: bib0044 article-title: Learning joint relationship attention network for image captioning publication-title: Expert. Syst. Appl. – year: 2020 ident: bib0050 article-title: arXiv preprint – year: 2021 ident: bib0033 article-title: Rstnet: captioning with adaptive attention on visual and non-visual words publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition – year: 2016 ident: bib0013 article-title: Spice: semantic propositional image caption evaluation publication-title: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14 – year: 2022 ident: bib0073 article-title: Reformer: the relational transformer for image captioning publication-title: Proceedings of the 30th ACM International Conference on Multimedia – volume: 128 year: 2022 ident: bib0070 article-title: Position-guided transformer for image captioning publication-title: Image Vis. Comput. – ident: 10.1016/j.image.2025.117273_bib0057 – start-page: 30 year: 2017 ident: 10.1016/j.image.2025.117273_bib0024 article-title: Attention is all you need publication-title: Adv. Neural Inf. Process. Syst. – year: 2023 ident: 10.1016/j.image.2025.117273_bib0003 article-title: Classification of diabetic retinopathy by machine learning algorithm using entorpy-based features – year: 2019 ident: 10.1016/j.image.2025.117273_bib0054 article-title: Exploring overall contextual information for image captioning in human-like cognitive style – year: 2019 ident: 10.1016/j.image.2025.117273_bib0075 article-title: Deliberate attention networks for image captioning – year: 2005 ident: 10.1016/j.image.2025.117273_bib0010 article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments – year: 2016 ident: 10.1016/j.image.2025.117273_bib0013 article-title: Spice: semantic propositional image caption evaluation – volume: 211 year: 2023 ident: 10.1016/j.image.2025.117273_bib0044 article-title: Learning joint relationship attention network for image captioning publication-title: Expert. Syst. Appl. doi: 10.1016/j.eswa.2022.118474 – volume: 11 start-page: 1732 issue: 12 year: 2022 ident: 10.1016/j.image.2025.117273_bib0004 article-title: Artificial intelligence-based robust hybrid algorithm design and implementation for real-time detection of plant diseases in agricultural environments publication-title: Biology. doi: 10.3390/biology11121732 – volume: 10 start-page: 18 issue: 1 year: 2023 ident: 10.1016/j.image.2025.117273_bib0071 article-title: Image caption generation using visual attention prediction and contextual spatial relation extraction publication-title: J. Big. Data doi: 10.1186/s40537-023-00693-9 – volume: 75 year: 2021 ident: 10.1016/j.image.2025.117273_bib0060 article-title: Parallel-fusion LSTM with synchronous semantic and visual information for image captioning publication-title: J. Vis. Commun. Image Represent. doi: 10.1016/j.jvcir.2021.103044 – year: 2020 ident: 10.1016/j.image.2025.117273_bib0034 – year: 2022 ident: 10.1016/j.image.2025.117273_bib0074 article-title: End-to-end transformer based model for image captioning – year: 2015 ident: 10.1016/j.image.2025.117273_bib0020 article-title: Show, attend and tell: neural image caption generation with visual attention – year: 2018 ident: 10.1016/j.image.2025.117273_bib0023 article-title: Bottom-up and top-down attention for image captioning and visual question answering – year: 2017 ident: 10.1016/j.image.2025.117273_bib0022 article-title: Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning – year: 2021 ident: 10.1016/j.image.2025.117273_bib0033 article-title: Rstnet: captioning with adaptive attention on visual and non-visual words – volume: 60 start-page: 84 issue: 6 year: 2017 ident: 10.1016/j.image.2025.117273_bib0015 article-title: Imagenet classification with deep convolutional neural networks publication-title: Commun ACM doi: 10.1145/3065386 – year: 2021 ident: 10.1016/j.image.2025.117273_bib0035 article-title: Training data-efficient image transformers & distillation through attention – volume: 9 start-page: 1735 issue: 8 year: 1997 ident: 10.1016/j.image.2025.117273_bib0001 article-title: Long short-term memory publication-title: Neural Comput. doi: 10.1162/neco.1997.9.8.1735 – volume: 519 start-page: 69 year: 2023 ident: 10.1016/j.image.2025.117273_bib0046 article-title: MAENet: a novel multi-head association attention enhancement network for completing intra-modal interaction in image captioning publication-title: Neurocomputing. doi: 10.1016/j.neucom.2022.11.045 – year: 2018 ident: 10.1016/j.image.2025.117273_bib0048 article-title: Exploring visual relationship for image captioning – year: 2019 ident: 10.1016/j.image.2025.117273_bib0049 article-title: Aligning linguistic words and visual semantic units for image captioning – year: 2014 ident: 10.1016/j.image.2025.117273_bib0017 – year: 2017 ident: 10.1016/j.image.2025.117273_bib0018 article-title: Self-critical sequence training for image captioning – year: 2020 ident: 10.1016/j.image.2025.117273_bib0028 article-title: Normalized and geometry-aware self-attention network for image captioning – volume: 17 start-page: 1 issue: 2 year: 2021 ident: 10.1016/j.image.2025.117273_bib0068 article-title: Integrating scene semantic knowledge into image captioning publication-title: ACM Trans. Multimedia Comput. Commun. Appl. (TOMM) doi: 10.1145/3439734 – year: 2019 ident: 10.1016/j.image.2025.117273_bib0026 article-title: Entangled transformer for image captioning – volume: 2 start-page: 67 year: 2014 ident: 10.1016/j.image.2025.117273_bib0063 article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions publication-title: Trans. Assoc. Comput. Linguist. doi: 10.1162/tacl_a_00166 – year: 2022 ident: 10.1016/j.image.2025.117273_bib0055 – year: 2022 ident: 10.1016/j.image.2025.117273_bib0073 article-title: Reformer: the relational transformer for image captioning – year: 2019 ident: 10.1016/j.image.2025.117273_bib0051 article-title: Hierarchy parsing for image captioning – year: 2015 ident: 10.1016/j.image.2025.117273_bib0012 article-title: Cider: consensus-based image description evaluation – year: 2016 ident: 10.1016/j.image.2025.117273_bib0005 – year: 2021 ident: 10.1016/j.image.2025.117273_bib0036 – year: 2014 ident: 10.1016/j.image.2025.117273_bib0016 – year: 2019 ident: 10.1016/j.image.2025.117273_bib0025 article-title: Auto-encoding scene graphs for image captioning – volume: 201 year: 2020 ident: 10.1016/j.image.2025.117273_bib0067 article-title: The synergy of double attention: combine sentence-level and word-level attention for image captioning publication-title: Computer Vis. Image Underst. doi: 10.1016/j.cviu.2020.103068 – year: 2021 ident: 10.1016/j.image.2025.117273_bib0037 – year: 2021 ident: 10.1016/j.image.2025.117273_bib0039 – volume: 228 year: 2023 ident: 10.1016/j.image.2025.117273_bib0042 article-title: Cross-domain multi-style merge for image captioning publication-title: Computer Vision and Image Understanding doi: 10.1016/j.cviu.2022.103617 – year: 2018 ident: 10.1016/j.image.2025.117273_bib0058 – volume: 128 year: 2022 ident: 10.1016/j.image.2025.117273_bib0070 article-title: Position-guided transformer for image captioning publication-title: Image Vis. Comput. doi: 10.1016/j.imavis.2022.104575 – year: 2021 ident: 10.1016/j.image.2025.117273_bib0072 article-title: Improving image captioning by leveraging intra-and inter-layer global representation in transformer network – year: 2020 ident: 10.1016/j.image.2025.117273_bib0030 article-title: X-linear attention networks for image captioning – year: 2015 ident: 10.1016/j.image.2025.117273_bib0056 article-title: Long-term recurrent convolutional networks for visual recognition and description – year: 2020 ident: 10.1016/j.image.2025.117273_bib0059 article-title: Unified vision-language pre-training for image captioning and vqa – volume: 78 year: 2021 ident: 10.1016/j.image.2025.117273_bib0047 article-title: Attention-guided image captioning with adaptive global and local feature fusion publication-title: J. Vis. Commun. Image Represent. doi: 10.1016/j.jvcir.2021.103138 – year: 2014 ident: 10.1016/j.image.2025.117273_bib0008 article-title: Microsoft coco: common objects in context – year: 2023 ident: 10.1016/j.image.2025.117273_bib0006 article-title: Reltr: relation transformer for scene graph generation publication-title: IEEe Trans. Pattern. Anal. Mach. Intell. doi: 10.1109/TPAMI.2023.3268066 – year: 2004 ident: 10.1016/j.image.2025.117273_bib0011 article-title: A package for automatic evaluation of summaries – volume: 38 start-page: 1 issue: 5 year: 2019 ident: 10.1016/j.image.2025.117273_bib0061 article-title: Dynamic graph cnn for learning on point clouds publication-title: Acm Trans. Graphics (tog) doi: 10.1145/3326362 – year: 2020 ident: 10.1016/j.image.2025.117273_bib0032 article-title: In defense of grid features for visual question answering – year: 2015 ident: 10.1016/j.image.2025.117273_bib0062 article-title: Deep visual-semantic alignments for generating image descriptions – year: 2023 ident: 10.1016/j.image.2025.117273_bib0066 – year: 2023 ident: 10.1016/j.image.2025.117273_bib0041 article-title: Cross on cross attention: deep fusion transformer for image captioning – year: 2022 ident: 10.1016/j.image.2025.117273_bib0065 – year: 2020 ident: 10.1016/j.image.2025.117273_bib0040 article-title: Oscar: object-semantics aligned pre-training for vision-language tasks – ident: 10.1016/j.image.2025.117273_bib0027 – year: 2016 ident: 10.1016/j.image.2025.117273_bib0007 article-title: Deep residual learning for image recognition – year: 2021 ident: 10.1016/j.image.2025.117273_bib0038 – year: 2014 ident: 10.1016/j.image.2025.117273_bib0019 – year: 2002 ident: 10.1016/j.image.2025.117273_bib0009 article-title: Bleu: a method for automatic evaluation of machine translation – year: 2015 ident: 10.1016/j.image.2025.117273_bib0014 article-title: Show and tell: a neural image caption generator – year: 2019 ident: 10.1016/j.image.2025.117273_bib0029 article-title: Attention on attention for image captioning – volume: 623 start-page: 812 year: 2023 ident: 10.1016/j.image.2025.117273_bib0045 article-title: Label-attention transformer with geometrically coherent objects for image captioning publication-title: Inf Sci (Ny) doi: 10.1016/j.ins.2022.12.018 – volume: 31 start-page: 3920 year: 2022 ident: 10.1016/j.image.2025.117273_bib0069 article-title: Visual cluster grounding for image captioning publication-title: IEEE Trans. Image Proc. doi: 10.1109/TIP.2022.3177318 – volume: 129 year: 2023 ident: 10.1016/j.image.2025.117273_bib0053 article-title: Modeling graph-structured contexts for image captioning publication-title: Image Vis. Comput. doi: 10.1016/j.imavis.2022.104591 – volume: 77 year: 2023 ident: 10.1016/j.image.2025.117273_bib0043 article-title: Relational-Convergent Transformer for image captioning publication-title: Displays doi: 10.1016/j.displa.2023.102377 – year: 2020 ident: 10.1016/j.image.2025.117273_bib0050 – year: 2022 ident: 10.1016/j.image.2025.117273_bib0052 article-title: Relational Graph Reasoning Transformer for Image Captioning – year: 2020 ident: 10.1016/j.image.2025.117273_bib0031 article-title: Meshed-memory transformer for image captioning – volume: 7 start-page: 598 issue: 8 year: 2023 ident: 10.1016/j.image.2025.117273_bib0002 article-title: Overcoming nonlinear dynamics in diabetic retinopathy classification: a robust AI-based model with chaotic swarm intelligence optimization and recurrent long short-term memory publication-title: Fract. Fraction. doi: 10.3390/fractalfract7080598 – year: 2017 ident: 10.1016/j.image.2025.117273_bib0021 article-title: Knowing when to look: adaptive attention via a visual sentinel for image captioning – year: 2014 ident: 10.1016/j.image.2025.117273_bib0064
SSID	ssj0002409
Score	2.412697
Snippet	•Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide...
SourceID	crossref elsevier
SourceType	Index Database Publisher
StartPage	117273
SubjectTerms	Attention mechanism Image captioning Semantic Graph Spatial graph
Title	Graph-based image captioning with semantic and spatial features
URI	https://dx.doi.org/10.1016/j.image.2025.117273
Volume	133
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELaqssDAo4Aoj8oDI6atE9vJhKqKUkDqApW6RX5FChKlImXlt3OXOKJIiIExVi5xzvbdF_vzZ0IuHaByaWPFrOWSQfSzLDGOM2MjrXH3DBgi22Imp_P4YSEWLTJu9sIgrTLE_jqmV9E6lPSDN_uroug_DQCbiFTiuhyCHtxRHscKe_n15zfNAzJWrbfHI4Z3N8pDFcereIVBCz-JXODiJVfR79lpI-NM9slugIp0VNfmgLT8skP2AmykYVCWUNSczNCUdcjOhszgIbm5Q1VqhgnL0aoy1OpVmImlOBNLS_8KLi4s1UtHS2RZw5tzX6l-lkdkPrl9Hk9ZODiBWS6iNRNCGaljB-hfocJfwm1uUp9zlUjvjPRJnsiB99xAK6WJS1OhXCS1t1Jyrnl0TNrLt6U_IRTyl-fQ25QewgP90IhI5DzSuZVWayO65KpxWLaq9TGyhjj2klWflKF_s9q_XSIbp2Y_mjmDCP6X4el_Dc_INl7VZJtz0l6_f_gLwBFr06s6So9sje4fp7MvACLHbg
linkProvider	Elsevier
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3PT8IwFH5BOKgHf6BG_NmDRxugo912MoSIQ5CLkHBb2q5LMGESh_-_r1tnMDEevHZ5W_e1fe9r-_oV4C5BVi50z6daM0HR-2kaqIRRpT0p7ekZNLTZFlMRzXvPC76owaA6C2PTKp3vL3164a1dSduh2V4vl-3XDnITHgq7L2dJj78DDatOxevQ6I_G0fTbIWPQKiX3mEetQSU-VKR5LVc4bnGeyLjdv2S-93uA2go6wyM4cGyR9MsKHUPNZE04dMyRuHGZY1F1OUNV1oT9LaXBE3h4ssLU1MashBSVIVqu3WIssYuxJDcrRHmpicwSkttEa_xyagrhz_wU5sPH2SCi7u4Eqhn3NpRzXwnZS3AC4FuRv4DpVIUmZX4gTKKECdJAdIxhChsqDJIw5H7iCWm0EIxJ5p1BPXvPzDkQDGGGYYfzZRdfaLqKezxlnky10FIq3oL7CrB4XUpkxFXu2Ftc_FJs8Y1LfFsgKlDjHy0doxP_y_Div4a3sBvNXibxZDQdX8KefVLm3lxBffPxaa6RVmzUjes2X-upyh8
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Graph-based+image+captioning+with+semantic+and+spatial+features&rft.jtitle=Signal+processing.+Image+communication&rft.au=Parseh%2C+Mohammad+Javad&rft.au=Ghadiri%2C+Saeed&rft.date=2025-04-01&rft.pub=Elsevier+B.V&rft.issn=0923-5965&rft.volume=133&rft_id=info:doi/10.1016%2Fj.image.2025.117273&rft.externalDocID=S0923596525000207
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0923-5965&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0923-5965&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0923-5965&client=summon