HAResformer: A Hybrid ResNet-Transformer Hierarchical Aggregation Architecture for Visible-Infrared Person Reidentification

Modality differences and intramodality variations make the visible-infrared person reidentification (VI-ReID) task highly challenging. Most existing methods focus on building network frameworks based on convolutional neural networks (CNN) or pure vision transformers (ViT) to extract discriminative f...

Full description

Saved in:

Bibliographic Details
Published in	IEEE internet of things journal Vol. 12; no. 12; pp. 21691 - 21703
Main Authors	Qian, Yongheng, Tang, Su-Kit
Format	Journal Article
Language	English
Published	Piscataway IEEE 15.06.2025 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Aggregates Artificial neural networks Coders Computer vision Convolutional neural networks Convolutional neural networks (CNN) cross-modality Data mining Feature extraction feature fusion Identification of persons Internet of Things multiscale supervision person reidentification (ReID) Representations Robustness Semantics Spatial data Transformers vision transformer Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Modality differences and intramodality variations make the visible-infrared person reidentification (VI-ReID) task highly challenging. Most existing methods focus on building network frameworks based on convolutional neural networks (CNN) or pure vision transformers (ViT) to extract discriminative features and address these challenges. However, these methods neglect several key issues: deeply fusing local features with global spatial information enhances comprehensive discriminative representation, patch tokens contain rich semantic information, and different feature extraction stages within the network emphasize various semantic elements. To address these issues, we propose a novel hybrid ResNet-transformer hierarchical aggregation architecture named HAResformer. HAResformer comprises three key components: 1) hierarchical feature extraction (HFE) framework; 2) deeply supervised aggregation (DSA); and 3) hierarchical global aggregate encoder (HGAE). Specifically, HFE introduces a lightweight cross-encoder feature fusion module (CFFM) to deeply integrate the local features and global spatial information of a person extracted by the ResNet encoder (RE) and transformer encoder (TE). Subsequently, the fused features are fed as global priors into the next-stage TE for deep interaction, aiming to extract specific local features and global contextual clues. Additionally, DSA and HGAE provide auxiliary supervision and aggregation on multiscale features to enhance multigranularity feature representation. HAResformer effectively alleviates modality differences and reduces intramodality variations. Extensive experiments on three benchmarks demonstrate the effectiveness and generalization of our architecture and outperform most state-of-the-art methods. HAResformer has the potential to become a new VI-ReID baseline, promoting high-quality research in the future.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2327-4662 2327-4662
DOI:	10.1109/JIOT.2025.3547920