Learning a multimodal feature transformer for RGBT tracking

RGB-thermal (RGBT) tracking aims to achieve reliable visual tracking effects, especially in challenging environments characterized by drastic illumination changes, adverse weather conditions, and background clutter, enabling robust tracking in all-day and all-weather scenarios through the utilizatio...

Full description

Saved in:
Bibliographic Details
Published inSignal, image and video processing Vol. 18; no. Suppl 1; pp. 239 - 250
Main Authors Shi, Huiwei, Mu, Xiaodong, Shen, Danyao, Zhong, Chengliang
Format Journal Article
LanguageEnglish
Published London Springer London 2024
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:RGB-thermal (RGBT) tracking aims to achieve reliable visual tracking effects, especially in challenging environments characterized by drastic illumination changes, adverse weather conditions, and background clutter, enabling robust tracking in all-day and all-weather scenarios through the utilization of multimodal complementary information. Despite the significant progress achieved in this field, certain existing dual-stream RGBT object tracking methods tend to suppress low-quality or low-contribution modal features during the fusion phase, consequently limiting the ability to attain further tracking performance improvements. To address this limitation, this paper proposes a novel dual-stream hierarchical transformer fusion network that has an enhanced capacity to use local and global discriminative information derived from both the RGB and thermal modalities. Our approach incorporates a multimodal feature transformer encoder, which is enriched with modulation layers that adaptively extract modality-specific features. This adaptive fusion process effectively combines both low-quality and high-quality modal information, thus enhancing the ability of the network to represent the modal features contained in both the RGB and thermal branches. Additionally, we leverage dynamic anchor boxes and denoising-based training methods to accelerate the dual-stream transformer training process. The effectiveness of our proposed method is demonstrated through comprehensive experimental results on RGBT datasets, where it outperforms the state-of-the-art tracking methods, demonstrating its superiority in challenging tracking scenarios.
ISSN:1863-1703
1863-1711
DOI:10.1007/s11760-024-03148-7