LLM-powered scene graph representation learning for image retrieval via visual triplet-based graph transformation
•Image retrieval system leveraging LLM-powered high-level visual context.•Convert a scene graph into a visual triplet-based graph with triplets as nodes.•Graph embedding reflects the importance of visual triplets via attention mechanism.•VTGT achieves superior image retrieval performance compared to...
Saved in:
Published in | Expert systems with applications Vol. 286; p. 127926 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
15.08.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •Image retrieval system leveraging LLM-powered high-level visual context.•Convert a scene graph into a visual triplet-based graph with triplets as nodes.•Graph embedding reflects the importance of visual triplets via attention mechanism.•VTGT achieves superior image retrieval performance compared to baselines.
[Display omitted]
A scene graph represents the relational information between objects within an image, conveying its inherent semantic content. Current image retrieval methods, which use images as queries to find similar ones, typically rely on visual content or basic structural similarities in scene graphs. However, these methods use only basic and surface-level information, overlooking the high-level semantic information embedded in the scene graph. In this study, we leverage visual triplet units, consisting of subject-relation-object pairs in the scene graph, to capture high-level semantics more effectively. To enhance the triplets, we incorporate extensive knowledge from large language models (LLMs). We propose Visual Triplet-based Graph Transformation (VTGT), a framework that transforms the scene graph into a visual triplet-based graph, which is the triplets serve as the nodes. This transformed graph is then processed by a graph neural network (GNN) to learn an optimal scene graph representation. Experimental results in image retrieval demonstrate the superior performance of our approach, driven by the LLM-powered visual triplet-based graph representation. |
---|---|
ISSN: | 0957-4174 |
DOI: | 10.1016/j.eswa.2025.127926 |