Image Captioning Using Scene Graph Generation

Understanding and generating natural language descriptions from images is a fundamental challenge in vision-language tasks within artificial intelligence. This paper introduces a novel image captioning framework that integrates scene graph generation to improve the semantic richness of generated cap...

Full description

Saved in:
Bibliographic Details
Published in2025 International Conference on Wireless Communications Signal Processing and Networking (WiSPNET) pp. 1 - 5
Main Authors S, Shiny, A, Abisheak, P, Balaji, K, Rohit
Format Conference Proceeding
LanguageEnglish
Published IEEE 20.03.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Understanding and generating natural language descriptions from images is a fundamental challenge in vision-language tasks within artificial intelligence. This paper introduces a novel image captioning framework that integrates scene graph generation to improve the semantic richness of generated captions. The proposed method employs the Relation Transformer (RelTR) model to extract structural representations from visual scenes in the form of subject-predicate-object triplets. A transformer-based captioning model then utilizes these structured scene graphs to produce fluent and contextually accurate captions. Experimental evaluations on the Visual Genome dataset demonstrate that our approach yields superior semantic coherence and captioning accuracy compared to traditional image-to-text models. The incorporation of relational scene understanding results in captions that are more contextually informed and descriptive.
DOI:10.1109/WiSPNET64060.2025.11005338