Graph-based image captioning with semantic and spatial features
•Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object rel...
Saved in:
Published in | Signal processing. Image communication Vol. 133; p. 117273 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.04.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •Objective: To enrich an image captioning model by leveraging spatial and semantic relations among objects along with standard visual features to provide context-rich and correct captions.•Key Methodology:1.Employs RelTR for object bounding box extraction and finding the subject-predicate-object relationship.2.Designs and extracts contextual features using Graph Convolutional Networks, creating both spatial and semantic graphs.3.An LSTM decoder is used to incorporate the visual features from CNN, the graph-based features, and the word embeddings via a multi-modal attention mechanism.•Results: The method shows competitive results compared to state-of-the-art methods and yields contextually aware and accurate descriptions that draw from deeper contextual pools.•Impact: The key methodology enables uses in applications for automatic captioning, scene interpretation, and assistive technology.
Image captioning is a challenging task of image processing that aims to generate descriptive and accurate textual descriptions for images. In this paper, we propose a novel image captioning framework that leverages the power of spatial and semantic relationships between objects in an image, in addition to traditional visual features. Our approach integrates a pre-trained model, RelTR, as a backbone for extracting object bounding boxes and subject-predicate-object relationship pairs. We use these extracted relationships to construct spatial and semantic graphs, which are processed through separate Graph Convolutional Networks (GCNs) to obtain high-level contextualized features. At the same time, a CNN model is employed to extract visual features from the input image. To merge the feature vectors seamlessly, our approach involves using a multi-modal attention mechanism that is applied separately to the feature maps of the image, the nodes of the semantic graph, and the nodes of the spatial graph during each time step of the LSTM-based decoder. The model concatenates the attended features with the word embedding at the respective time step and fed into the LSTM cell. Our experiments demonstrate the effectiveness of our proposed approach, which competes closely with existing state-of-the-art image captioning techniques by capturing richer contextual information and generating accurate and semantically meaningful captions.
© 2025 Elsevier Inc. All rights reserved. |
---|---|
ISSN: | 0923-5965 |
DOI: | 10.1016/j.image.2025.117273 |