Image Captioning Using Scene Graph Generation

Understanding and generating natural language descriptions from images is a fundamental challenge in vision-language tasks within artificial intelligence. This paper introduces a novel image captioning framework that integrates scene graph generation to improve the semantic richness of generated cap...

Full description

Saved in:

Bibliographic Details
Published in	2025 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET) pp. 1 - 5
Main Authors	S, Shiny, A, Abisheak, P, Balaji, K, Rohit
Format	Conference Proceeding
Language	English
Published	IEEE 20.03.2025
Subjects	Accuracy Bioinformatics Context modeling Deep Learning Genomics Image Captioning Relation Transformer Scene Graph Generation Semantics Signal processing SPICE Transformers Visual Genome Visualization Wireless communication
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Understanding and generating natural language descriptions from images is a fundamental challenge in vision-language tasks within artificial intelligence. This paper introduces a novel image captioning framework that integrates scene graph generation to improve the semantic richness of generated captions. The proposed method employs the Relation Transformer (RelTR) model to extract structural representations from visual scenes in the form of subject-predicate-object triplets. A transformer-based captioning model then utilizes these structured scene graphs to produce fluent and contextually accurate captions. Experimental evaluations on the Visual Genome dataset demonstrate that our approach yields superior semantic coherence and captioning accuracy compared to traditional image-to-text models. The incorporation of relational scene understanding results in captions that are more contextually informed and descriptive.
DOI:	10.1109/WiSPNET64060.2025.11005338