Improving language-supervised object detection with linguistic structure analysis

Language-supervised object detection typically uses descriptive captions from human-annotated datasets. However, in-the-wild captions take on wider styles of language. We analyze one particular ubiquitous form of language: narrative. We study the differences in linguistic structure and visual-text a...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW) pp. 5560 - 5570
Main Authors Rai, Arushi, Kovashka, Adriana
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Language-supervised object detection typically uses descriptive captions from human-annotated datasets. However, in-the-wild captions take on wider styles of language. We analyze one particular ubiquitous form of language: narrative. We study the differences in linguistic structure and visual-text alignment in narrative and descriptive captions and find we can classify descriptive and narrative style captions using linguistic features such as part of speech, rhetoric structure theory, and multimodal discourse. Then, we use this to select captions from which to extract image-level labels as supervision for weakly supervised object detection. We also improve the quality of extracted labels by filtering based on proximity to verb types for both descriptive and narrative captions.
ISSN:2160-7516
DOI:10.1109/CVPRW59228.2023.00588