Homogeneous and Heterogeneous Feature Learning Based on Large Models for Unsupervised Text-to-Image Person Re-Identification

Text-to-image person re-identification (TIReID) aims to identify and retrieve target pedestrians according to given textual queries. Driven by enormous annotated data, existing supervised learning methods have achieved promising performance. However, manually annotating large-scale databases is extr...

Full description

Saved in:

Bibliographic Details
Published in	2025 IEEE 2nd International Conference on Deep Learning and Computer Vision (DLCV) pp. 1 - 5
Main Authors	Shao, Chenglong, Si, Tongzhen, Zhou, Jiehan, Yang, Xiaohui
Format	Conference Proceeding
Language	English
Published	IEEE 06.06.2025
Subjects	Adaptation models Computational modeling Identification of persons large model Pedestrians Reliability engineering Representation learning Supervised learning Text to image Text-to-image person re-identification Unsupervised learning
Online Access	Get full text
DOI	10.1109/DLCV65218.2025.11088688

Cover

More Information
Summary:	Text-to-image person re-identification (TIReID) aims to identify and retrieve target pedestrians according to given textual queries. Driven by enormous annotated data, existing supervised learning methods have achieved promising performance. However, manually annotating large-scale databases is extremely time-consuming and impractical, which restricts their application in practical scenarios. Several methods fine-tune one MLLM to construct cross-modality databases and employ contrastive loss to constrain sample features. However, they neglect the reliability of text generation and feature optimization processes. To this end, we propose Homogeneous and Heterogeneous Feature Learning based on Large Models (HHLLM) for unsupervised TIReID task. Firstly, we design a text generation process with joint large models that leverage the diversity strength of MLLMs to generate and filter reliable texts for constructing image-text matching relationships. Secondly, we introduce an adapter-based learning strategy to transfer image-text prior knowledge and enhance the feature representation capability. Furthermore, we construct a Homogeneous and Heterogeneous Feature Learning (HHFL) process, which continuously optimizes the intra-modality and inter-modality features from class and instance views. We perform extensive experiments on benchmark TIReID databases to evaluate HHFLLM. The experimental results demonstrate that our method achieves state-of-the-art performance compared to unsupervised methods.
DOI:	10.1109/DLCV65218.2025.11088688