CKT-RCM: Clip-Based Knowledge Transfer and Relational Context Mining for Unbiased Panoptic Scene Graph Generation

Panoptic Scene Graph (PSG) generation aims to generate a scene graph representing pairwise relationship between objects within an image. Its use of pixel-wise segmentation mask and inclusion of background regions in relationship inference make it quickly become a popular approach. However, it has an...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings of the ... IEEE International Conference on Acoustics, Speech and Signal Processing (1998) pp. 3570 - 3574
Main Authors	Liang, Nanhao, Liu, Yong, Sun, Wenfang, Xia, Yingwei, Wang, Fan
Format	Conference Proceeding
Language	English
Published	IEEE 14.04.2024
Subjects	attention mechanism Data models Feature extraction Image segmentation panoptic segmentation Predictive models Scene graph generation Signal processing Task analysis visual-linguistic knowledge Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Panoptic Scene Graph (PSG) generation aims to generate a scene graph representing pairwise relationship between objects within an image. Its use of pixel-wise segmentation mask and inclusion of background regions in relationship inference make it quickly become a popular approach. However, it has an intrinsic challenge that the trained relationship predictors are either of low value or of low quality due to the long-tail distribution of typical datasets. Inspired by how humans use prior knowledge to greatly simplify this problem, we bring in two novel designs, using a pre-trained vision-language model to correct the data skewness, and using conditional prior distribution on contexts to further refine the prediction quality. Specifically, the approach named CKT-RCM first exploits relation-associated visual features from the image encoder and constructs a relation classifier by extracting text embeddings for all relationships from the text encoder of the vision-language model. It also utilizes rich relational context from subject-object pairs to facilitate informative relation predictions via a cross-attention mechanism. We conduct comprehensive experiments on the OpenPSG dataset and achieve state-of-the-art performance.
ISSN:	2379-190X
DOI:	10.1109/ICASSP48485.2024.10446810