Spatial-Channel Enhanced Transformer for Visible-Infrared Person Re-Identification

Visible-infrared person re-identification (VI-ReID) is a challenging task in computer vision, aiming at matching people across images from visible and infrared modalities. The widely used VI-ReID framework consists of a convolution neural backbone network that extracts the visual features, and a fea...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 25; pp. 3668 - 3680
Main Authors	Zhao, Jiaqi, Wang, Hanzheng, Zhou, Yong, Yao, Rui, Chen, Silin, Saddik, Abdulmotaleb El
Format	Journal Article
Language	English
Published	Piscataway IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Computer networks Computer vision Convolution Cross-modality person re-identification deep learning Embedding Feature extraction image retrieval Infrared imagery Modules Object detection Representation learning Representations Task analysis Transformers Visual discrimination visual Transformer Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Visible-infrared person re-identification (VI-ReID) is a challenging task in computer vision, aiming at matching people across images from visible and infrared modalities. The widely used VI-ReID framework consists of a convolution neural backbone network that extracts the visual features, and a feature embedding network to project heterogeneous features to the same feature space. However, many studies based on the existing pre-trained models neglect potential correlations between different locations and channels within a single sample during the feature extraction. Inspired by the success of the Transformer in computer vision, we extend it to enhance feature representation for VI-ReID. In this paper, we propose a discriminative feature learning network based on a visual Transformer (DFLN-ViT) for VI-ReID. Firstly, to capture long-term dependencies between different locations, we propose a spatial feature awareness module (SAM), which utilizes a single-layer Transformer with a novel patch-embedding strategy to encode location information. Secondly, to refine the representation at each channel, we design a channel feature enhancement module (CEM). The CEM treats the features of each channel as a sequence of Transformer inputs, taking advantage of the Transformer's ability to model long-term dependencies. Finally, we propose a Triplet-aided Hetero-Center (THC) loss to learn more discriminative feature representation by balancing the cross-modality distance and intra-modality distance of the center. The experimental results on two datasets show that our method can significantly improve the VI-ReID performance, outperforming most state-of-the-art methods.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2022.3163847