Grounded situation recognition under data scarcity

Grounded Situation Recognition (GSR) aims to generate structured image descriptions. For a given image, GSR needs to identify the key verb, the nouns corresponding to roles, and their bounding-box groundings. However, current GSR research demands numerous meticulously labeled images, which are labor...

Full description

Saved in:

Bibliographic Details
Published in	Scientific reports Vol. 14; no. 1; pp. 25195 - 16
Main Authors	Zhou, Jing, Liu, Zhiqiang, Hu, Siying, Li, Xiaoxue, Wang, Zhiguang, Lu, Qiang
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 24.10.2024 Nature Publishing Group Nature Portfolio
Subjects	639/705/117 639/705/258 Accuracy Classification CLIP Data Scarcity Grounded Situation Recognition Humanities and Social Sciences Localization multidisciplinary Scarcity Science Science (multidisciplinary) Transformer Transformer Grounded Situation Recognition Data Scarcity CLIP
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Grounded Situation Recognition (GSR) aims to generate structured image descriptions. For a given image, GSR needs to identify the key verb, the nouns corresponding to roles, and their bounding-box groundings. However, current GSR research demands numerous meticulously labeled images, which are labor-intensive and time-consuming, making it costly to expand detection categories. Our study enhances model accuracy in detecting and localizing under data scarcity, reducing dependency on large datasets and paving the way for broader detection capabilities. In this paper, we propose the Grounded Situation Recognition under Data Scarcity (GSRDS) model, which uses the CoFormer model as the baseline and optimizes three subtasks: image feature extraction, verb classification, and bounding-box localization, to better adapt to data-scarce scenarios. Specifically, we replace ResNet50 with EfficientNetV2-M for advanced image feature extraction. Additionally, we introduce the Transformer Combined with CLIP for Verb Classification (TCCV) module, utilizing features extracted by CLIP’s image encoder to enhance verb classification accuracy. Furthermore, we design the Multi-source Verb-Role Queries (Multi-VR Queries) and the Dual Parallel Decoders (DPD) modules to improve the accuracy of bounding-box localization. Through extensive comparative experiments and ablation studies, we demonstrate that our method achieves higher accuracy than mainstream approaches in data-scarce scenarios. Our code will be available at https://github.com/Zhou-maker-oss/GSRDS .
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2045-2322 2045-2322
DOI:	10.1038/s41598-024-75823-1