Dense Interaction Learning for Video-based Person Re-identification

Video-based person re-identification (re-ID) aims at matching the same person across video clips. Efficiently exploiting multi-scale fine-grained features while building the structural interaction among them is pivotal for its success. In this paper, we propose a hybrid framework, Dense Interaction...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 1470 - 1481
Main Authors	He, Tianyu, Jin, Xin, Shen, Xu, Huang, Jianqiang, Chen, Zhibo, Hua, Xian-Sheng
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2021
Subjects	Architecture Buildings Computer architecture Computer vision Decoding Feature extraction Image and video retrieval Task analysis Video analysis and understanding Vision applications and systems
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Video-based person re-identification (re-ID) aims at matching the same person across video clips. Efficiently exploiting multi-scale fine-grained features while building the structural interaction among them is pivotal for its success. In this paper, we propose a hybrid framework, Dense Interaction Learning (DenseIL), that takes the principal advantages of both CNN-based and Attention-based architectures to tackle video-based person re-ID difficulties. DenseIL contains a CNN encoder and a Dense Interaction (DI) decoder. The CNN encoder is responsible for efficiently extracting discriminative spatial features while the DI decoder is designed to densely model spatial-temporal inherent interaction across frames. Different from previous works, we additionally let the DI decoder densely attends to intermediate fine-grained CNN features and that naturally yields multi-grained spatial-temporal representation for each video clip. Moreover, we introduce Spatio-TEmporal Positional Embedding (STEP-Emb) into the DI decoder to investigate the positional relation among the spatial-temporal inputs. Our experiments consistently and significantly outperform all the state-of-the-art methods on multiple standard video-based person re-ID datasets.
ISSN:	2380-7504
DOI:	10.1109/ICCV48922.2021.00152