CMC2R: Cross‐modal collaborative contextual representation for RGBT tracking

The key challenge in RBGT tracking is how to fuse dual‐modality information to build a robust RGB‐T tracker. Motivated by CNN structure for local features, and visual transformer structure for global representations, the authors propose a two‐stream hybrid structure, termed CMC2R, to take advantage...

Full description

Saved in:
Bibliographic Details
Published inIET image processing Vol. 16; no. 5; pp. 1500 - 1510
Main Authors Liu, Xiaohu, Luo, Yichuang, Yan, Keding, Chen, Jianfei, Lei, Zhiyong
Format Journal Article
LanguageEnglish
Published Wiley 01.04.2022
Online AccessGet full text

Cover

Loading…
More Information
Summary:The key challenge in RBGT tracking is how to fuse dual‐modality information to build a robust RGB‐T tracker. Motivated by CNN structure for local features, and visual transformer structure for global representations, the authors propose a two‐stream hybrid structure, termed CMC2R, to take advantage of convolutional operations and self‐attention mechanisms to lean the enhanced representation. CMC2R fuses local features and global representations under different resolutions through the transformer layer of the encoder block, and the two modalities are collaborated to get contextual information by the spatial and channel self‐attention. The temporal association is performed with the track query, each track query models the entire track of an object, and updated frame‐by‐frame to build the long‐range temporal relation. Experimental results show the effectiveness of the proposed method, and achieve the SOTAs performance.
ISSN:1751-9659
1751-9667
DOI:10.1049/ipr2.12427