Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated dat...

Full description

Saved in:
Bibliographic Details
Published inNeural networks Vol. 173; p. 106156
Main Authors Huang, Lian, Peng, Zongju, Chen, Fen, Dai, Shaosheng, He, Ziqiang, Liu, Kesheng
Format Journal Article
LanguageEnglish
Published United States Elsevier Ltd 01.05.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated data for training. Inspired by the concept of few-shot learning, we propose a novel task called few-shot multispectral object detection (FSMOD) that aims to accomplish MOD using only a few annotated data from each category. Specifically, we first design a cross-modality interaction (CMI) module, which leverages different attention mechanisms to interact with the information from visible and thermal modalities during backbone feature extraction. With the guidance of interaction process, the detector is able to extract modality-specific backbone features with better discrimination. To improve the few-shot learning ability of the detector, we also design a semantic prototype metric (SPM) loss that integrates semantic knowledge, i.e., word embeddings, into the optimization process of embedding space. Semantic knowledge provides stable category representation when visual information is insufficient. Extensive experiments on the customized FSMOD dataset demonstrate that the proposed method achieves state-of-the-art performance. •A task involving the multispectral object detection within a few-shot setting.•Cross-modality information effectively enhances the expression of backbone features.•Semantic knowledge provides stable category representations.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0893-6080
1879-2782
DOI:10.1016/j.neunet.2024.106156