Graph-based image captioning with semantic and spatial features

•Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object rel...

Full description

Saved in:
Bibliographic Details
Published inSignal processing. Image communication Vol. 133; p. 117273
Main Authors Parseh, Mohammad Javad, Ghadiri, Saeed
Format Journal Article
LanguageEnglish
Published Elsevier B.V 01.04.2025
Subjects
Online AccessGet full text

Cover

Loading…
Abstract •Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object relationship.2.Designs and extracts contextual features using Graph Convolutional Networks, creating both spatial and semantic graphs.3.An LSTM decoder is used to incorporate the visual features from CNN, the graph-based features, and the word embeddings via a multi-modal attention mechanism.•Results: The method shows competitive results compared to state-of-the-art methods and yields contextually aware and accurate descriptions that draw from deeper contextual pools.•Impact: The key methodology enables uses in applications for automatic captioning, scene interpretation, and assistive technology. Image captioning is a challenging task of image processing that aims to generate descriptive and accurate textual descriptions for images. In this paper, we propose a novel image captioning framework that leverages the power of spatial and semantic relationships between objects in an image, in addition to traditional visual features. Our approach integrates a pre-trained model, RelTR, as a backbone for extracting object bounding boxes and subject-predicate-object relationship pairs. We use these extracted relationships to construct spatial and semantic graphs, which are processed through separate Graph Convolutional Networks (GCNs) to obtain high-level contextualized features. At the same time, a CNN model is employed to extract visual features from the input image. To merge the feature vectors seamlessly, our approach involves using a multi-modal attention mechanism that is applied separately to the feature maps of the image, the nodes of the semantic graph, and the nodes of the spatial graph during each time step of the LSTM-based decoder. The model concatenates the attended features with the word embedding at the respective time step and fed into the LSTM cell. Our experiments demonstrate the effectiveness of our proposed approach, which competes closely with existing state-of-the-art image captioning techniques by capturing richer contextual information and generating accurate and semantically meaningful captions. © 2025 Elsevier Inc. All rights reserved.
AbstractList •Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object relationship.2.Designs and extracts contextual features using Graph Convolutional Networks, creating both spatial and semantic graphs.3.An LSTM decoder is used to incorporate the visual features from CNN, the graph-based features, and the word embeddings via a multi-modal attention mechanism.•Results: The method shows competitive results compared to state-of-the-art methods and yields contextually aware and accurate descriptions that draw from deeper contextual pools.•Impact: The key methodology enables uses in applications for automatic captioning, scene interpretation, and assistive technology. Image captioning is a challenging task of image processing that aims to generate descriptive and accurate textual descriptions for images. In this paper, we propose a novel image captioning framework that leverages the power of spatial and semantic relationships between objects in an image, in addition to traditional visual features. Our approach integrates a pre-trained model, RelTR, as a backbone for extracting object bounding boxes and subject-predicate-object relationship pairs. We use these extracted relationships to construct spatial and semantic graphs, which are processed through separate Graph Convolutional Networks (GCNs) to obtain high-level contextualized features. At the same time, a CNN model is employed to extract visual features from the input image. To merge the feature vectors seamlessly, our approach involves using a multi-modal attention mechanism that is applied separately to the feature maps of the image, the nodes of the semantic graph, and the nodes of the spatial graph during each time step of the LSTM-based decoder. The model concatenates the attended features with the word embedding at the respective time step and fed into the LSTM cell. Our experiments demonstrate the effectiveness of our proposed approach, which competes closely with existing state-of-the-art image captioning techniques by capturing richer contextual information and generating accurate and semantically meaningful captions. © 2025 Elsevier Inc. All rights reserved.
ArticleNumber 117273
Author Parseh, Mohammad Javad
Ghadiri, Saeed
Author_xml – sequence: 1
  givenname: Mohammad Javad
  orcidid: 0000-0003-0109-3133
  surname: Parseh
  fullname: Parseh, Mohammad Javad
  email: parseh@jahromu.ac.ir
– sequence: 2
  givenname: Saeed
  surname: Ghadiri
  fullname: Ghadiri, Saeed
BookMark eNp9j0FPwyAYhjnMxG36C7zwB1opFCgHY8yim8kSL3omFL5uNBttADX-e7fVs6f39Lx5ngWahSEAQncVKStSifu-9Eezg5ISysuqklSyGZoTRVnBleDXaJFSTwihNVFz9LiOZtwXrUng8AXE1ozZD8GHHf72eY8THE3I3mITHE6jyd4ccAcmf0ZIN-iqM4cEt3-7RB8vz--rTbF9W7-unraFpZzlgnPZClM7UivJasoaartWQUdlI8C1ApquEQSAtraWqnFKcemYMGCFoNRQtkRs-rVxSClCp8d40o0_uiL63K17fdHX5249dZ-oh4mCk9qXh6iT9RAsOB_BZu0G_y__CwV2Zc8
Cites_doi 10.1016/j.eswa.2022.118474
10.3390/biology11121732
10.1186/s40537-023-00693-9
10.1016/j.jvcir.2021.103044
10.1145/3065386
10.1162/neco.1997.9.8.1735
10.1016/j.neucom.2022.11.045
10.1145/3439734
10.1162/tacl_a_00166
10.1016/j.cviu.2020.103068
10.1016/j.cviu.2022.103617
10.1016/j.imavis.2022.104575
10.1016/j.jvcir.2021.103138
10.1109/TPAMI.2023.3268066
10.1145/3326362
10.1016/j.ins.2022.12.018
10.1109/TIP.2022.3177318
10.1016/j.imavis.2022.104591
10.1016/j.displa.2023.102377
10.3390/fractalfract7080598
ContentType Journal Article
Copyright 2025 Elsevier B.V.
Copyright_xml – notice: 2025 Elsevier B.V.
DBID AAYXX
CITATION
DOI 10.1016/j.image.2025.117273
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Engineering
Computer Science
ExternalDocumentID 10_1016_j_image_2025_117273
S0923596525000207
GroupedDBID --K
--M
.DC
.~1
0R~
123
1B1
1~.
1~5
4.4
457
4G.
5VS
7-5
71M
8P~
9JN
AACTN
AAEDT
AAEDW
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AAXKI
AAXUO
AAYFN
ABBOA
ABDPE
ABFNM
ABMAC
ABWVN
ABXDB
ACDAQ
ACGFS
ACNNM
ACRLP
ACRPL
ACZNC
ADBBV
ADEZE
ADJOM
ADMUD
ADNMO
ADTZH
AEBSH
AECPX
AEIPS
AEKER
AFJKZ
AFTJW
AGHFR
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIKHN
AITUG
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
AOUOD
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CS3
EBS
EFJIC
EJD
EO8
EO9
EP2
EP3
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
IHE
J1W
JJJVA
KOM
LG9
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
PQQKQ
Q38
R2-
RIG
ROL
RPZ
SBC
SDF
SDG
SDP
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
WUQ
XPP
ZMT
~G-
AATTM
AAYWO
AAYXX
AFXIZ
AGCQF
AGQPQ
AGRNS
AIIUN
APXCP
BNPGV
CITATION
SSH
ID FETCH-LOGICAL-c253t-557b6a4d0497342382cfb9ef2786edb6e8f860ee2bc4798d9957d36aec6622a23
IEDL.DBID .~1
ISSN 0923-5965
IngestDate Tue Jul 01 05:31:37 EDT 2025
Sat Mar 01 15:45:21 EST 2025
IsPeerReviewed true
IsScholarly true
Keywords Spatial graph
Attention mechanism
Image captioning
Semantic Graph
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c253t-557b6a4d0497342382cfb9ef2786edb6e8f860ee2bc4798d9957d36aec6622a23
ORCID 0000-0003-0109-3133
ParticipantIDs crossref_primary_10_1016_j_image_2025_117273
elsevier_sciencedirect_doi_10_1016_j_image_2025_117273
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate April 2025
2025-04-00
PublicationDateYYYYMMDD 2025-04-01
PublicationDate_xml – month: 04
  year: 2025
  text: April 2025
PublicationDecade 2020
PublicationTitle Signal processing. Image communication
PublicationYear 2025
Publisher Elsevier B.V
Publisher_xml – name: Elsevier B.V
References Simonyan, Zisserman (bib0017) 2014
Hochreiter, Schmidhuber (bib0001) 1997; 9
Zhang (bib0033) 2021
Wei (bib0067) 2020; 201
Guo (bib0028) 2020
Kingma, Ba (bib0064) 2014
Dubey (bib0045) 2023; 623
Sasibhooshan, Kumaraswamy, Sasidharan (bib0071) 2023; 10
Kipf, Welling (bib0005) 2016
Wang (bib0061) 2019; 38
Krizhevsky, Sutskever, Hinton (bib0015) 2017; 60
Donahue (bib0056) 2015
Gao (bib0075) 2019
Shen (bib0037) 2021
Karpathy, Fei-Fei (bib0062) 2015
Herdade, S., et al., Image captioning: transforming objects into words. Advances in neural information processing systems, 2019. 32.
Chen (bib0043) 2023; 77
Young (bib0063) 2014; 2
Banerjee, Lavie (bib0010) 2005
Duan (bib0042) 2023; 228
Touvron (bib0035) 2021
He (bib0007) 2016
Cornia (bib0039) 2021
Radford, A., et al., Improving language understanding by generative pre-training. 2018.
Huang (bib0029) 2019
Yao (bib0051) 2019
Wang, Gu (bib0044) 2023; 211
Wang, Xu, Sun (bib0074) 2022
Zhong (bib0047) 2021; 78
Zhang, Li, Wang (bib0060) 2021; 75
Li (bib0040) 2020
Özçelik, Altan (bib0003) 2023
Cong, Yang, Rosenhahn (bib0006) 2023
Wei (bib0068) 2021; 17
Hu (bib0046) 2023; 519
Zhang (bib0041) 2023
Yağ, Altan (bib0004) 2022; 11
Anderson (bib0013) 2016
Mao (bib0016) 2014
Anderson (bib0023) 2018
Ge (bib0054) 2019
Papineni (bib0009) 2002
Dosovitskiy (bib0034) 2020
Chen (bib0022) 2017
Liu (bib0036) 2021
ROUGE (bib0011) 2004
Lu (bib0021) 2017
Yang (bib0025) 2019
Hu (bib0070) 2022; 128
Zhou (bib0059) 2020
Jiang (bib0069) 2022; 31
Guo (bib0049) 2019
Bahdanau, Cho, Bengio (bib0019) 2014
Devlin (bib0058) 2018
Vinyals (bib0014) 2015
Li (bib0026) 2019
Jiang (bib0032) 2020
Moral (bib0065) 2022
Xu (bib0020) 2015
Lin (bib0008) 2014
Xiao (bib0052) 2022
Abedi, Karshenas, Adibi (bib0066) 2023
Shi (bib0050) 2020
Vaswani (bib0024) 2017
Yao (bib0048) 2018
Vedantam, Zitnick, Parikh (bib0012) 2015
Özçelik, Altan (bib0002) 2023; 7
Cornia (bib0031) 2020
Zhang (bib0055) 2022
Yang, Liu, Wang (bib0073) 2022
Rennie (bib0018) 2017
Mokady, Hertz, Bermano (bib0038) 2021
Ji (bib0072) 2021
Pan (bib0030) 2020
Li (bib0053) 2023; 129
Li (10.1016/j.image.2025.117273_bib0053) 2023; 129
Sasibhooshan (10.1016/j.image.2025.117273_bib0071) 2023; 10
Papineni (10.1016/j.image.2025.117273_bib0009) 2002
Krizhevsky (10.1016/j.image.2025.117273_bib0015) 2017; 60
Donahue (10.1016/j.image.2025.117273_bib0056) 2015
Wei (10.1016/j.image.2025.117273_bib0068) 2021; 17
Gao (10.1016/j.image.2025.117273_bib0075) 2019
Özçelik (10.1016/j.image.2025.117273_bib0002) 2023; 7
Cong (10.1016/j.image.2025.117273_bib0006) 2023
Yang (10.1016/j.image.2025.117273_bib0025) 2019
Ge (10.1016/j.image.2025.117273_bib0054) 2019
He (10.1016/j.image.2025.117273_bib0007) 2016
Dosovitskiy (10.1016/j.image.2025.117273_bib0034) 2020
Liu (10.1016/j.image.2025.117273_bib0036) 2021
Cornia (10.1016/j.image.2025.117273_bib0031) 2020
Yağ (10.1016/j.image.2025.117273_bib0004) 2022; 11
Guo (10.1016/j.image.2025.117273_bib0049) 2019
Wang (10.1016/j.image.2025.117273_bib0044) 2023; 211
Hochreiter (10.1016/j.image.2025.117273_bib0001) 1997; 9
ROUGE (10.1016/j.image.2025.117273_bib0011) 2004
Zhang (10.1016/j.image.2025.117273_bib0041) 2023
Zhou (10.1016/j.image.2025.117273_bib0059) 2020
Xu (10.1016/j.image.2025.117273_bib0020) 2015
Shi (10.1016/j.image.2025.117273_bib0050) 2020
Vaswani (10.1016/j.image.2025.117273_bib0024) 2017
Duan (10.1016/j.image.2025.117273_bib0042) 2023; 228
Jiang (10.1016/j.image.2025.117273_bib0069) 2022; 31
Kingma (10.1016/j.image.2025.117273_bib0064) 2014
Xiao (10.1016/j.image.2025.117273_bib0052) 2022
Yao (10.1016/j.image.2025.117273_bib0051) 2019
Lin (10.1016/j.image.2025.117273_bib0008) 2014
Bahdanau (10.1016/j.image.2025.117273_bib0019) 2014
Özçelik (10.1016/j.image.2025.117273_bib0003) 2023
Hu (10.1016/j.image.2025.117273_bib0070) 2022; 128
Guo (10.1016/j.image.2025.117273_bib0028) 2020
Wang (10.1016/j.image.2025.117273_bib0061) 2019; 38
Lu (10.1016/j.image.2025.117273_bib0021) 2017
Kipf (10.1016/j.image.2025.117273_bib0005) 2016
Moral (10.1016/j.image.2025.117273_bib0065) 2022
Chen (10.1016/j.image.2025.117273_bib0022) 2017
Zhang (10.1016/j.image.2025.117273_bib0055) 2022
Ji (10.1016/j.image.2025.117273_bib0072) 2021
Anderson (10.1016/j.image.2025.117273_bib0013) 2016
Chen (10.1016/j.image.2025.117273_bib0043) 2023; 77
Dubey (10.1016/j.image.2025.117273_bib0045) 2023; 623
Devlin (10.1016/j.image.2025.117273_bib0058) 2018
10.1016/j.image.2025.117273_bib0057
Anderson (10.1016/j.image.2025.117273_bib0023) 2018
Pan (10.1016/j.image.2025.117273_bib0030) 2020
Wei (10.1016/j.image.2025.117273_bib0067) 2020; 201
Vedantam (10.1016/j.image.2025.117273_bib0012) 2015
Simonyan (10.1016/j.image.2025.117273_bib0017) 2014
Vinyals (10.1016/j.image.2025.117273_bib0014) 2015
Jiang (10.1016/j.image.2025.117273_bib0032) 2020
Li (10.1016/j.image.2025.117273_bib0040) 2020
Zhong (10.1016/j.image.2025.117273_bib0047) 2021; 78
Banerjee (10.1016/j.image.2025.117273_bib0010) 2005
Abedi (10.1016/j.image.2025.117273_bib0066) 2023
Yang (10.1016/j.image.2025.117273_bib0073) 2022
Young (10.1016/j.image.2025.117273_bib0063) 2014; 2
Hu (10.1016/j.image.2025.117273_bib0046) 2023; 519
Yao (10.1016/j.image.2025.117273_bib0048) 2018
Karpathy (10.1016/j.image.2025.117273_bib0062) 2015
Wang (10.1016/j.image.2025.117273_bib0074) 2022
Rennie (10.1016/j.image.2025.117273_bib0018) 2017
Li (10.1016/j.image.2025.117273_bib0026) 2019
Huang (10.1016/j.image.2025.117273_bib0029) 2019
Cornia (10.1016/j.image.2025.117273_bib0039) 2021
Zhang (10.1016/j.image.2025.117273_bib0033) 2021
Shen (10.1016/j.image.2025.117273_bib0037) 2021
Zhang (10.1016/j.image.2025.117273_bib0060) 2021; 75
Mokady (10.1016/j.image.2025.117273_bib0038) 2021
Mao (10.1016/j.image.2025.117273_bib0016) 2014
10.1016/j.image.2025.117273_bib0027
Touvron (10.1016/j.image.2025.117273_bib0035) 2021
References_xml – year: 2016
  ident: bib0007
  article-title: Deep residual learning for image recognition
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2021
  ident: bib0072
  article-title: Improving image captioning by leveraging intra-and inter-layer global representation in transformer network
  publication-title: Proceedings of the AAAI conference on artificial intelligence
– year: 2020
  ident: bib0059
  article-title: Unified vision-language pre-training for image captioning and vqa
  publication-title: Proceedings of the AAAI conference on artificial intelligence
– year: 2023
  ident: bib0003
  article-title: Classification of diabetic retinopathy by machine learning algorithm using entorpy-based features
  publication-title: Proceedings of the ÇAnkaya International Congress on Scientific Research
– year: 2002
  ident: bib0009
  article-title: Bleu: a method for automatic evaluation of machine translation
  publication-title: Proceedings of the 40th annual meeting of the Association for Computational Linguistics
– year: 2014
  ident: bib0017
  article-title: arXiv preprint
– year: 2023
  ident: bib0066
  article-title: arXiv preprint
– year: 2014
  ident: bib0019
  article-title: arXiv preprint
– year: 2020
  ident: bib0028
  article-title: Normalized and geometry-aware self-attention network for image captioning
  publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
– year: 2021
  ident: bib0035
  article-title: Training data-efficient image transformers & distillation through attention
  publication-title: International conference on machine learning
– volume: 201
  year: 2020
  ident: bib0067
  article-title: The synergy of double attention: combine sentence-level and word-level attention for image captioning
  publication-title: Computer Vis. Image Underst.
– volume: 623
  start-page: 812
  year: 2023
  end-page: 831
  ident: bib0045
  article-title: Label-attention transformer with geometrically coherent objects for image captioning
  publication-title: Inf Sci (Ny)
– year: 2014
  ident: bib0016
  article-title: arXiv preprint
– year: 2017
  ident: bib0018
  article-title: Self-critical sequence training for image captioning
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2015
  ident: bib0062
  article-title: Deep visual-semantic alignments for generating image descriptions
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– start-page: 30
  year: 2017
  ident: bib0024
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2020
  ident: bib0030
  article-title: X-linear attention networks for image captioning
  publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
– year: 2005
  ident: bib0010
  article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments
  publication-title: Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization
– reference: Herdade, S., et al., Image captioning: transforming objects into words. Advances in neural information processing systems, 2019. 32.
– year: 2018
  ident: bib0048
  article-title: Exploring visual relationship for image captioning
  publication-title: Proceedings of the European conference on computer vision (ECCV)
– year: 2023
  ident: bib0041
  article-title: Cross on cross attention: deep fusion transformer for image captioning
  publication-title: IEEE Transactions on Circuits and Systems for Video Technology
– year: 2014
  ident: bib0008
  article-title: Microsoft coco: common objects in context
  publication-title: Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13
– year: 2014
  ident: bib0064
  article-title: arXiv preprint
– volume: 519
  start-page: 69
  year: 2023
  end-page: 81
  ident: bib0046
  article-title: MAENet: a novel multi-head association attention enhancement network for completing intra-modal interaction in image captioning
  publication-title: Neurocomputing.
– volume: 11
  start-page: 1732
  year: 2022
  ident: bib0004
  article-title: Artificial intelligence-based robust hybrid algorithm design and implementation for real-time detection of plant diseases in agricultural environments
  publication-title: Biology.
– year: 2017
  ident: bib0021
  article-title: Knowing when to look: adaptive attention via a visual sentinel for image captioning
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 75
  year: 2021
  ident: bib0060
  article-title: Parallel-fusion LSTM with synchronous semantic and visual information for image captioning
  publication-title: J. Vis. Commun. Image Represent.
– year: 2021
  ident: bib0039
  article-title: arXiv preprint
– year: 2018
  ident: bib0058
  article-title: arXiv preprint
– year: 2019
  ident: bib0029
  article-title: Attention on attention for image captioning
  publication-title: Proceedings of the IEEE/CVF international conference on computer vision
– volume: 17
  start-page: 1
  year: 2021
  end-page: 22
  ident: bib0068
  article-title: Integrating scene semantic knowledge into image captioning
  publication-title: ACM Trans. Multimedia Comput. Commun. Appl. (TOMM)
– volume: 129
  year: 2023
  ident: bib0053
  article-title: Modeling graph-structured contexts for image captioning
  publication-title: Image Vis. Comput.
– volume: 31
  start-page: 3920
  year: 2022
  end-page: 3934
  ident: bib0069
  article-title: Visual cluster grounding for image captioning
  publication-title: IEEE Trans. Image Proc.
– year: 2022
  ident: bib0052
  article-title: Relational Graph Reasoning Transformer for Image Captioning
  publication-title: 2022 IEEE International Conference on Multimedia and Expo (ICME)
– year: 2015
  ident: bib0056
  article-title: Long-term recurrent convolutional networks for visual recognition and description
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2019
  ident: bib0049
  article-title: Aligning linguistic words and visual semantic units for image captioning
  publication-title: Proceedings of the 27th ACM international conference on multimedia
– year: 2020
  ident: bib0032
  article-title: In defense of grid features for visual question answering
  publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
– year: 2019
  ident: bib0051
  article-title: Hierarchy parsing for image captioning
  publication-title: Proceedings of the IEEE/CVF international conference on computer vision
– year: 2023
  ident: bib0006
  article-title: Reltr: relation transformer for scene graph generation
  publication-title: IEEe Trans. Pattern. Anal. Mach. Intell.
– year: 2015
  ident: bib0020
  article-title: Show, attend and tell: neural image caption generation with visual attention
  publication-title: International conference on machine learning
– volume: 228
  year: 2023
  ident: bib0042
  article-title: Cross-domain multi-style merge for image captioning
  publication-title: Computer Vision and Image Understanding
– year: 2015
  ident: bib0014
  article-title: Show and tell: a neural image caption generator
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 78
  year: 2021
  ident: bib0047
  article-title: Attention-guided image captioning with adaptive global and local feature fusion
  publication-title: J. Vis. Commun. Image Represent.
– volume: 60
  start-page: 84
  year: 2017
  end-page: 90
  ident: bib0015
  article-title: Imagenet classification with deep convolutional neural networks
  publication-title: Commun ACM
– year: 2019
  ident: bib0054
  article-title: Exploring overall contextual information for image captioning in human-like cognitive style
  publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision
– year: 2022
  ident: bib0065
  publication-title: Automated Image Captioning with Multi-layer Gated Recurrent Unit. in 2022 30th European Signal Processing Conference (EUSIPCO)
– volume: 77
  year: 2023
  ident: bib0043
  article-title: Relational-Convergent Transformer for image captioning
  publication-title: Displays
– volume: 38
  start-page: 1
  year: 2019
  end-page: 12
  ident: bib0061
  article-title: Dynamic graph cnn for learning on point clouds
  publication-title: Acm Trans. Graphics (tog)
– year: 2022
  ident: bib0074
  article-title: End-to-end transformer based model for image captioning
  publication-title: Proceedings of the AAAI Conference on Artificial Intelligence
– year: 2017
  ident: bib0022
  article-title: Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2018
  ident: bib0023
  article-title: Bottom-up and top-down attention for image captioning and visual question answering
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– year: 2020
  ident: bib0031
  article-title: Meshed-memory transformer for image captioning
  publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
– year: 2020
  ident: bib0034
  article-title: arXiv preprint
– year: 2020
  ident: bib0040
  article-title: Oscar: object-semantics aligned pre-training for vision-language tasks
  publication-title: Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XXX 16
– volume: 9
  start-page: 1735
  year: 1997
  end-page: 1780
  ident: bib0001
  article-title: Long short-term memory
  publication-title: Neural Comput.
– reference: Radford, A., et al., Improving language understanding by generative pre-training. 2018.
– volume: 7
  start-page: 598
  year: 2023
  ident: bib0002
  article-title: Overcoming nonlinear dynamics in diabetic retinopathy classification: a robust AI-based model with chaotic swarm intelligence optimization and recurrent long short-term memory
  publication-title: Fract. Fraction.
– year: 2022
  ident: bib0055
  article-title: Image Caption Generation Using Contextual Information Fusion With Bi-LSTM-s
– year: 2021
  ident: bib0036
  article-title: arXiv preprint
– year: 2019
  ident: bib0026
  article-title: Entangled transformer for image captioning
  publication-title: Proceedings of the IEEE/CVF international conference on computer vision
– year: 2021
  ident: bib0037
  article-title: arXiv preprint
– year: 2019
  ident: bib0025
  article-title: Auto-encoding scene graphs for image captioning
  publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
– volume: 2
  start-page: 67
  year: 2014
  end-page: 78
  ident: bib0063
  article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions
  publication-title: Trans. Assoc. Comput. Linguist.
– year: 2004
  ident: bib0011
  article-title: A package for automatic evaluation of summaries
  publication-title: Proceedings of Workshop on Text Summarization of ACL
– year: 2015
  ident: bib0012
  article-title: Cider: consensus-based image description evaluation
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– volume: 10
  start-page: 18
  year: 2023
  ident: bib0071
  article-title: Image caption generation using visual attention prediction and contextual spatial relation extraction
  publication-title: J. Big. Data
– year: 2019
  ident: bib0075
  article-title: Deliberate attention networks for image captioning
  publication-title: Proceedings of the AAAI conference on artificial intelligence
– year: 2016
  ident: bib0005
  article-title: arXiv preprint
– year: 2021
  ident: bib0038
  article-title: arXiv preprint
– volume: 211
  year: 2023
  ident: bib0044
  article-title: Learning joint relationship attention network for image captioning
  publication-title: Expert. Syst. Appl.
– year: 2020
  ident: bib0050
  article-title: arXiv preprint
– year: 2021
  ident: bib0033
  article-title: Rstnet: captioning with adaptive attention on visual and non-visual words
  publication-title: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
– year: 2016
  ident: bib0013
  article-title: Spice: semantic propositional image caption evaluation
  publication-title: Computer Vision–ECCV 2016: 14th European Conference, Amsterdam, The Netherlands, October 11-14, 2016, Proceedings, Part V 14
– year: 2022
  ident: bib0073
  article-title: Reformer: the relational transformer for image captioning
  publication-title: Proceedings of the 30th ACM International Conference on Multimedia
– volume: 128
  year: 2022
  ident: bib0070
  article-title: Position-guided transformer for image captioning
  publication-title: Image Vis. Comput.
– ident: 10.1016/j.image.2025.117273_bib0057
– start-page: 30
  year: 2017
  ident: 10.1016/j.image.2025.117273_bib0024
  article-title: Attention is all you need
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2023
  ident: 10.1016/j.image.2025.117273_bib0003
  article-title: Classification of diabetic retinopathy by machine learning algorithm using entorpy-based features
– year: 2019
  ident: 10.1016/j.image.2025.117273_bib0054
  article-title: Exploring overall contextual information for image captioning in human-like cognitive style
– year: 2019
  ident: 10.1016/j.image.2025.117273_bib0075
  article-title: Deliberate attention networks for image captioning
– year: 2005
  ident: 10.1016/j.image.2025.117273_bib0010
  article-title: METEOR: an automatic metric for MT evaluation with improved correlation with human judgments
– year: 2016
  ident: 10.1016/j.image.2025.117273_bib0013
  article-title: Spice: semantic propositional image caption evaluation
– volume: 211
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0044
  article-title: Learning joint relationship attention network for image captioning
  publication-title: Expert. Syst. Appl.
  doi: 10.1016/j.eswa.2022.118474
– volume: 11
  start-page: 1732
  issue: 12
  year: 2022
  ident: 10.1016/j.image.2025.117273_bib0004
  article-title: Artificial intelligence-based robust hybrid algorithm design and implementation for real-time detection of plant diseases in agricultural environments
  publication-title: Biology.
  doi: 10.3390/biology11121732
– volume: 10
  start-page: 18
  issue: 1
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0071
  article-title: Image caption generation using visual attention prediction and contextual spatial relation extraction
  publication-title: J. Big. Data
  doi: 10.1186/s40537-023-00693-9
– volume: 75
  year: 2021
  ident: 10.1016/j.image.2025.117273_bib0060
  article-title: Parallel-fusion LSTM with synchronous semantic and visual information for image captioning
  publication-title: J. Vis. Commun. Image Represent.
  doi: 10.1016/j.jvcir.2021.103044
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0034
– year: 2022
  ident: 10.1016/j.image.2025.117273_bib0074
  article-title: End-to-end transformer based model for image captioning
– year: 2015
  ident: 10.1016/j.image.2025.117273_bib0020
  article-title: Show, attend and tell: neural image caption generation with visual attention
– year: 2018
  ident: 10.1016/j.image.2025.117273_bib0023
  article-title: Bottom-up and top-down attention for image captioning and visual question answering
– year: 2017
  ident: 10.1016/j.image.2025.117273_bib0022
  article-title: Sca-cnn: spatial and channel-wise attention in convolutional networks for image captioning
– year: 2021
  ident: 10.1016/j.image.2025.117273_bib0033
  article-title: Rstnet: captioning with adaptive attention on visual and non-visual words
– volume: 60
  start-page: 84
  issue: 6
  year: 2017
  ident: 10.1016/j.image.2025.117273_bib0015
  article-title: Imagenet classification with deep convolutional neural networks
  publication-title: Commun ACM
  doi: 10.1145/3065386
– year: 2021
  ident: 10.1016/j.image.2025.117273_bib0035
  article-title: Training data-efficient image transformers & distillation through attention
– volume: 9
  start-page: 1735
  issue: 8
  year: 1997
  ident: 10.1016/j.image.2025.117273_bib0001
  article-title: Long short-term memory
  publication-title: Neural Comput.
  doi: 10.1162/neco.1997.9.8.1735
– volume: 519
  start-page: 69
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0046
  article-title: MAENet: a novel multi-head association attention enhancement network for completing intra-modal interaction in image captioning
  publication-title: Neurocomputing.
  doi: 10.1016/j.neucom.2022.11.045
– year: 2018
  ident: 10.1016/j.image.2025.117273_bib0048
  article-title: Exploring visual relationship for image captioning
– year: 2019
  ident: 10.1016/j.image.2025.117273_bib0049
  article-title: Aligning linguistic words and visual semantic units for image captioning
– year: 2014
  ident: 10.1016/j.image.2025.117273_bib0017
– year: 2017
  ident: 10.1016/j.image.2025.117273_bib0018
  article-title: Self-critical sequence training for image captioning
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0028
  article-title: Normalized and geometry-aware self-attention network for image captioning
– volume: 17
  start-page: 1
  issue: 2
  year: 2021
  ident: 10.1016/j.image.2025.117273_bib0068
  article-title: Integrating scene semantic knowledge into image captioning
  publication-title: ACM Trans. Multimedia Comput. Commun. Appl. (TOMM)
  doi: 10.1145/3439734
– year: 2019
  ident: 10.1016/j.image.2025.117273_bib0026
  article-title: Entangled transformer for image captioning
– volume: 2
  start-page: 67
  year: 2014
  ident: 10.1016/j.image.2025.117273_bib0063
  article-title: From image descriptions to visual denotations: new similarity metrics for semantic inference over event descriptions
  publication-title: Trans. Assoc. Comput. Linguist.
  doi: 10.1162/tacl_a_00166
– year: 2022
  ident: 10.1016/j.image.2025.117273_bib0055
– year: 2022
  ident: 10.1016/j.image.2025.117273_bib0073
  article-title: Reformer: the relational transformer for image captioning
– year: 2019
  ident: 10.1016/j.image.2025.117273_bib0051
  article-title: Hierarchy parsing for image captioning
– year: 2015
  ident: 10.1016/j.image.2025.117273_bib0012
  article-title: Cider: consensus-based image description evaluation
– year: 2016
  ident: 10.1016/j.image.2025.117273_bib0005
– year: 2021
  ident: 10.1016/j.image.2025.117273_bib0036
– year: 2014
  ident: 10.1016/j.image.2025.117273_bib0016
– year: 2019
  ident: 10.1016/j.image.2025.117273_bib0025
  article-title: Auto-encoding scene graphs for image captioning
– volume: 201
  year: 2020
  ident: 10.1016/j.image.2025.117273_bib0067
  article-title: The synergy of double attention: combine sentence-level and word-level attention for image captioning
  publication-title: Computer Vis. Image Underst.
  doi: 10.1016/j.cviu.2020.103068
– year: 2021
  ident: 10.1016/j.image.2025.117273_bib0037
– year: 2021
  ident: 10.1016/j.image.2025.117273_bib0039
– volume: 228
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0042
  article-title: Cross-domain multi-style merge for image captioning
  publication-title: Computer Vision and Image Understanding
  doi: 10.1016/j.cviu.2022.103617
– year: 2018
  ident: 10.1016/j.image.2025.117273_bib0058
– volume: 128
  year: 2022
  ident: 10.1016/j.image.2025.117273_bib0070
  article-title: Position-guided transformer for image captioning
  publication-title: Image Vis. Comput.
  doi: 10.1016/j.imavis.2022.104575
– year: 2021
  ident: 10.1016/j.image.2025.117273_bib0072
  article-title: Improving image captioning by leveraging intra-and inter-layer global representation in transformer network
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0030
  article-title: X-linear attention networks for image captioning
– year: 2015
  ident: 10.1016/j.image.2025.117273_bib0056
  article-title: Long-term recurrent convolutional networks for visual recognition and description
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0059
  article-title: Unified vision-language pre-training for image captioning and vqa
– volume: 78
  year: 2021
  ident: 10.1016/j.image.2025.117273_bib0047
  article-title: Attention-guided image captioning with adaptive global and local feature fusion
  publication-title: J. Vis. Commun. Image Represent.
  doi: 10.1016/j.jvcir.2021.103138
– year: 2014
  ident: 10.1016/j.image.2025.117273_bib0008
  article-title: Microsoft coco: common objects in context
– year: 2023
  ident: 10.1016/j.image.2025.117273_bib0006
  article-title: Reltr: relation transformer for scene graph generation
  publication-title: IEEe Trans. Pattern. Anal. Mach. Intell.
  doi: 10.1109/TPAMI.2023.3268066
– year: 2004
  ident: 10.1016/j.image.2025.117273_bib0011
  article-title: A package for automatic evaluation of summaries
– volume: 38
  start-page: 1
  issue: 5
  year: 2019
  ident: 10.1016/j.image.2025.117273_bib0061
  article-title: Dynamic graph cnn for learning on point clouds
  publication-title: Acm Trans. Graphics (tog)
  doi: 10.1145/3326362
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0032
  article-title: In defense of grid features for visual question answering
– year: 2015
  ident: 10.1016/j.image.2025.117273_bib0062
  article-title: Deep visual-semantic alignments for generating image descriptions
– year: 2023
  ident: 10.1016/j.image.2025.117273_bib0066
– year: 2023
  ident: 10.1016/j.image.2025.117273_bib0041
  article-title: Cross on cross attention: deep fusion transformer for image captioning
– year: 2022
  ident: 10.1016/j.image.2025.117273_bib0065
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0040
  article-title: Oscar: object-semantics aligned pre-training for vision-language tasks
– ident: 10.1016/j.image.2025.117273_bib0027
– year: 2016
  ident: 10.1016/j.image.2025.117273_bib0007
  article-title: Deep residual learning for image recognition
– year: 2021
  ident: 10.1016/j.image.2025.117273_bib0038
– year: 2014
  ident: 10.1016/j.image.2025.117273_bib0019
– year: 2002
  ident: 10.1016/j.image.2025.117273_bib0009
  article-title: Bleu: a method for automatic evaluation of machine translation
– year: 2015
  ident: 10.1016/j.image.2025.117273_bib0014
  article-title: Show and tell: a neural image caption generator
– year: 2019
  ident: 10.1016/j.image.2025.117273_bib0029
  article-title: Attention on attention for image captioning
– volume: 623
  start-page: 812
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0045
  article-title: Label-attention transformer with geometrically coherent objects for image captioning
  publication-title: Inf Sci (Ny)
  doi: 10.1016/j.ins.2022.12.018
– volume: 31
  start-page: 3920
  year: 2022
  ident: 10.1016/j.image.2025.117273_bib0069
  article-title: Visual cluster grounding for image captioning
  publication-title: IEEE Trans. Image Proc.
  doi: 10.1109/TIP.2022.3177318
– volume: 129
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0053
  article-title: Modeling graph-structured contexts for image captioning
  publication-title: Image Vis. Comput.
  doi: 10.1016/j.imavis.2022.104591
– volume: 77
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0043
  article-title: Relational-Convergent Transformer for image captioning
  publication-title: Displays
  doi: 10.1016/j.displa.2023.102377
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0050
– year: 2022
  ident: 10.1016/j.image.2025.117273_bib0052
  article-title: Relational Graph Reasoning Transformer for Image Captioning
– year: 2020
  ident: 10.1016/j.image.2025.117273_bib0031
  article-title: Meshed-memory transformer for image captioning
– volume: 7
  start-page: 598
  issue: 8
  year: 2023
  ident: 10.1016/j.image.2025.117273_bib0002
  article-title: Overcoming nonlinear dynamics in diabetic retinopathy classification: a robust AI-based model with chaotic swarm intelligence optimization and recurrent long short-term memory
  publication-title: Fract. Fraction.
  doi: 10.3390/fractalfract7080598
– year: 2017
  ident: 10.1016/j.image.2025.117273_bib0021
  article-title: Knowing when to look: adaptive attention via a visual sentinel for image captioning
– year: 2014
  ident: 10.1016/j.image.2025.117273_bib0064
SSID ssj0002409
Score 2.412697
Snippet •Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide...
SourceID crossref
elsevier
SourceType Index Database
Publisher
StartPage 117273
SubjectTerms Attention mechanism
Image captioning
Semantic Graph
Spatial graph
Title Graph-based image captioning with semantic and spatial features
URI https://dx.doi.org/10.1016/j.image.2025.117273
Volume 133
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELaqssDAo4Aoj8oDI6atE9vJhKqKUkDqApW6RX5FChKlImXlt3OXOKJIiIExVi5xzvbdF_vzZ0IuHaByaWPFrOWSQfSzLDGOM2MjrXH3DBgi22Imp_P4YSEWLTJu9sIgrTLE_jqmV9E6lPSDN_uroug_DQCbiFTiuhyCHtxRHscKe_n15zfNAzJWrbfHI4Z3N8pDFcereIVBCz-JXODiJVfR79lpI-NM9slugIp0VNfmgLT8skP2AmykYVCWUNSczNCUdcjOhszgIbm5Q1VqhgnL0aoy1OpVmImlOBNLS_8KLi4s1UtHS2RZw5tzX6l-lkdkPrl9Hk9ZODiBWS6iNRNCGaljB-hfocJfwm1uUp9zlUjvjPRJnsiB99xAK6WJS1OhXCS1t1Jyrnl0TNrLt6U_IRTyl-fQ25QewgP90IhI5DzSuZVWayO65KpxWLaq9TGyhjj2klWflKF_s9q_XSIbp2Y_mjmDCP6X4el_Dc_INl7VZJtz0l6_f_gLwBFr06s6So9sje4fp7MvACLHbg
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV3PT8IwFH5BOKgHf6BG_NmDRxugo912MoSIQ5CLkHBb2q5LMGESh_-_r1tnMDEevHZ5W_e1fe9r-_oV4C5BVi50z6daM0HR-2kaqIRRpT0p7ekZNLTZFlMRzXvPC76owaA6C2PTKp3vL3164a1dSduh2V4vl-3XDnITHgq7L2dJj78DDatOxevQ6I_G0fTbIWPQKiX3mEetQSU-VKR5LVc4bnGeyLjdv2S-93uA2go6wyM4cGyR9MsKHUPNZE04dMyRuHGZY1F1OUNV1oT9LaXBE3h4ssLU1MashBSVIVqu3WIssYuxJDcrRHmpicwSkttEa_xyagrhz_wU5sPH2SCi7u4Eqhn3NpRzXwnZS3AC4FuRv4DpVIUmZX4gTKKECdJAdIxhChsqDJIw5H7iCWm0EIxJ5p1BPXvPzDkQDGGGYYfzZRdfaLqKezxlnky10FIq3oL7CrB4XUpkxFXu2Ftc_FJs8Y1LfFsgKlDjHy0doxP_y_Div4a3sBvNXibxZDQdX8KefVLm3lxBffPxaa6RVmzUjes2X-upyh8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Graph-based+image+captioning+with+semantic+and+spatial+features&rft.jtitle=Signal+processing.+Image+communication&rft.au=Parseh%2C+Mohammad+Javad&rft.au=Ghadiri%2C+Saeed&rft.date=2025-04-01&rft.pub=Elsevier+B.V&rft.issn=0923-5965&rft.volume=133&rft_id=info:doi/10.1016%2Fj.image.2025.117273&rft.externalDocID=S0923596525000207
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0923-5965&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0923-5965&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0923-5965&client=summon