FEXNet: Foreground Extraction Network for Human Action Recognition

As most human actions in video sequences embody the continuous interactions between foregrounds rather than the background scene, it is significant to disentangle these foregrounds from the background for advanced action recognition systems. In this paper, therefore, we propose a Foreground EXtracti...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 32; no. 5; pp. 3141 - 3151
Main Authors	Shen, Zhongwei, Wu, Xiao-Jun, Xu, Tianyang
Format	Journal Article
Language	English
Published	New York IEEE 01.05.2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	action recognition Convolutional neural networks Feature extraction Feature maps Foreground-related features Human activity recognition Human motion Image recognition Iron Modelling Modules Solid modeling spatiotemporal modeling Spatiotemporal phenomena Three-dimensional displays Two dimensional models
Online Access	Get full text

Cover

Loading…

More Information
Summary:	As most human actions in video sequences embody the continuous interactions between foregrounds rather than the background scene, it is significant to disentangle these foregrounds from the background for advanced action recognition systems. In this paper, therefore, we propose a Foreground EXtraction (FEX) block to explicitly model the foreground clues to achieve effective management of action subjects. In particular, the designed FEX block contains two components. The first part is a Foreground Enhancement (FE) module, which highlights the potential feature channels related to the action attributes, providing channel-level refinement for the following spatiotemporal modeling. The second phase is a Scene Segregation (SS) module, which splits feature maps into foreground and background. Specifically, a temporal model with dynamic enhancement is constructed for the foreground part, reflecting the essential nature of the action category. While the background is modeled using simple spatial convolutions, mapping the inputs to the consistent feature space. The FEX blocks can be inserted into existing 2D CNNs (denoted as FEXNet) for spatiotemporal modeling, concentrating on the foreground clues for effective action inference. Our experiments performed on Something-Something V1, V2 and Kinetics400 verify the effectiveness of the proposed method.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2021.3103677