Cross-modality interaction for few-shot multispectral object detection with semantic knowledge

Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated dat...

Full description

Saved in:

Bibliographic Details
Published in	Neural networks Vol. 173; p. 106156
Main Authors	Huang, Lian, Peng, Zongju, Chen, Fen, Dai, Shaosheng, He, Ziqiang, Liu, Kesheng
Format	Journal Article
Language	English
Published	United States Elsevier Ltd 01.05.2024
Subjects	Few-shot learning Metric learning Object detection Semantic knowledge Few-shot learning Semantic knowledge Metric learning Object detection
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated data for training. Inspired by the concept of few-shot learning, we propose a novel task called few-shot multispectral object detection (FSMOD) that aims to accomplish MOD using only a few annotated data from each category. Specifically, we first design a cross-modality interaction (CMI) module, which leverages different attention mechanisms to interact with the information from visible and thermal modalities during backbone feature extraction. With the guidance of interaction process, the detector is able to extract modality-specific backbone features with better discrimination. To improve the few-shot learning ability of the detector, we also design a semantic prototype metric (SPM) loss that integrates semantic knowledge, i.e., word embeddings, into the optimization process of embedding space. Semantic knowledge provides stable category representation when visual information is insufficient. Extensive experiments on the customized FSMOD dataset demonstrate that the proposed method achieves state-of-the-art performance. •A task involving the multispectral object detection within a few-shot setting.•Cross-modality information effectively enhances the expression of backbone features.•Semantic knowledge provides stable category representations.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0893-6080 1879-2782
DOI:	10.1016/j.neunet.2024.106156