Motion Stimulation for Compositional Action Recognition

Recognizing the unseen combinations of action and different objects, namely (zero-shot) compositional action recognition, is extremely challenging for conventional action recognition algorithms in real-world applications. Previous methods focus on enhancing the dynamic clues of objects that appear i...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 33; no. 5; pp. 2061 - 2074
Main Authors	Ma, Lei, Zheng, Yuhui, Zhang, Zhao, Yao, Yazhou, Fan, Xijian, Ye, Qiaolin
Format	Journal Article
Language	English
Published	New York IEEE 01.05.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	action-centric excitation Algorithms Annotations Cognition compositional action recognition Computational modeling Datasets Detectors Dynamics Feature extraction motion feature extraction motion feature recalibration Motion stimulation Moving object recognition Stimulation Task analysis Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recognizing the unseen combinations of action and different objects, namely (zero-shot) compositional action recognition, is extremely challenging for conventional action recognition algorithms in real-world applications. Previous methods focus on enhancing the dynamic clues of objects that appear in the scene by building region features or tracklet embedding from ground-truths or detected bounding boxes. These methods rely heavily on manual annotation or the quality of detectors, which are inflexible for practical applications. In this work, we aim to mining the temporal clues from moving objects or hands without explicit supervision. Thus, we propose a novel Motion Stimulation (MS) block, which is specifically designed to mine dynamic clues of the local regions autonomously from adjacent frames. Furthermore, MS consists of the following three steps: motion feature extraction, motion feature recalibration, and action-centric excitation. The proposed MS block can be directly and conveniently integrated into existing video backbones to enhance the ability of compositional generalization for action recognition algorithms. Extensive experimental results on three action recognition datasets, the Something-Else, IKEA-Assembly and EPIC-KITCHENS datasets, indicate the effectiveness and interpretability of our MS block.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2022.3222305