Cross-modality interaction for few-shot multispectral object detection with semantic knowledge
Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated dat...
Saved in:
Published in | Neural networks Vol. 173; p. 106156 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
Elsevier Ltd
01.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Multispectral object detection (MOD), which incorporates additional information from thermal images into object detection (OD) to robustly cope with complex illumination conditions, has garnered significant attention. However, existing MOD methods always demand a considerable amount of annotated data for training. Inspired by the concept of few-shot learning, we propose a novel task called few-shot multispectral object detection (FSMOD) that aims to accomplish MOD using only a few annotated data from each category. Specifically, we first design a cross-modality interaction (CMI) module, which leverages different attention mechanisms to interact with the information from visible and thermal modalities during backbone feature extraction. With the guidance of interaction process, the detector is able to extract modality-specific backbone features with better discrimination. To improve the few-shot learning ability of the detector, we also design a semantic prototype metric (SPM) loss that integrates semantic knowledge, i.e., word embeddings, into the optimization process of embedding space. Semantic knowledge provides stable category representation when visual information is insufficient. Extensive experiments on the customized FSMOD dataset demonstrate that the proposed method achieves state-of-the-art performance.
•A task involving the multispectral object detection within a few-shot setting.•Cross-modality information effectively enhances the expression of backbone features.•Semantic knowledge provides stable category representations. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0893-6080 1879-2782 |
DOI: | 10.1016/j.neunet.2024.106156 |