Improving embedding learning by virtual attribute decoupling for text-based person search

This paper considers the problem of text-based person search, which aims to find the target person based on a query textual description. Previous methods commonly focus on learning shared image-text embeddings, but largely ignore the effect of pedestrian attributes. Attributes are fine-grained infor...

Full description

Saved in:
Bibliographic Details
Published inNeural computing & applications Vol. 34; no. 7; pp. 5625 - 5647
Main Authors Wang, Chengji, Luo, Zhiming, Lin, Yaojin, Li, Shaozi
Format Journal Article
LanguageEnglish
Published London Springer London 01.04.2022
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper considers the problem of text-based person search, which aims to find the target person based on a query textual description. Previous methods commonly focus on learning shared image-text embeddings, but largely ignore the effect of pedestrian attributes. Attributes are fine-grained information, which provide mid-level semantics and have been demonstrated to be effective in traditional image-based person search. However, in text-based person search, it is hard to incorporate attribute information to learn discriminative image-text embeddings, because (1) the description of attributes could be various at different texts and (2) it is hard to decouple attributes-related information without the help of attribute annotations. In this paper, we propose an improving embedding learning by virtual attribute decoupling (iVAD) model for learning modality-invariant image-text embeddings. To the best of our knowledge, this is the first work which performs unsupervised attribute decoupling in text-based person search task. In the iVAD, we first propose a novel virtual attribute decoupling (VAD) module which uses an encoder-decoder embedding learning structure to decompose attribute information from image and text. In this module, we regard the pedestrian attributes as a hidden vector and obtain attribute-related embeddings. In addition, different from previous works which separates attribute learning from image-text embedding learning, we propose a hierarchical feature embedding framework. We incorporate the attribute-related embeddings into learned image-text embeddings by an attribute-enhanced feature embedding (AEFE) module. The proposed AEFE module can utilize attribute information to improve discriminability of learned features. Extensive evaluations demonstrate the superiority of our method over a wide variety of state-of-the-art methods on the CUHK-PEDES dataset. The experimental results on Caltech-UCSD Birds (CUB), Oxford-102 Flowers (Flowers) and Flickr30K verify the effectiveness of the proposed approach. A further visualization shows that the proposed iVAD model can effectively discover the co-occurring pedestrian attributes in corresponded image-text pairs.
ISSN:0941-0643
1433-3058
DOI:10.1007/s00521-021-06734-9