Action Tube Extraction Based 3D-CNN for RGB-D Action Recognition

In this paper we propose a novel action tube extractor for RGB-D action recognition in trimmed videos. The action tube extractor takes as input a video and outputs an action tube. The method consists of two parts: spatial tube extraction and temporal sampling. The first part is built upon MobileNet-...

Full description

Saved in:

Bibliographic Details
Published in	2018 International Conference on Content Based Multimedia Indexing (CBMI) pp. 1 - 6
Main Authors	Xu, Zineng, Vilaplana, Veronica, Morros, Josep Ramon
Format	Conference Proceeding Publication
Language	English
Published	IEEE 01.09.2018 Institute of Electrical and Electronics Engineers (IEEE)
Subjects	3-D video (Three-dimensional imaging) 3D-CNN action recognition action tube extraction action tube extraction extraction CNN models Computer architecture Creació multimèdia Data models Digital techniques Digital video Electron tubes Enginyeria de la telecomunicació Image processing Imatges Indexes indexing (of information) Processament Processament de la imatge i del senyal vídeo Processament del senyal So, imatge i multimèdia Solid modeling spatial regions state-of-the-art methods structural similarity indices (SSIM) temporal sampling Three-dimensional displays Training tubes (components) two-stream Tècniques digitals Visualització tridimensional (Informàtica) Vídeo digital Àrees temàtiques de la UPC
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper we propose a novel action tube extractor for RGB-D action recognition in trimmed videos. The action tube extractor takes as input a video and outputs an action tube. The method consists of two parts: spatial tube extraction and temporal sampling. The first part is built upon MobileNet-SSD and its role is to define the spatial region where the action takes place. The second part is based on the structural similarity index (SSIM) and is designed to remove frames without obvious motion from the primary action tube. The final extracted action tube has two benefits: 1) a higher ratio of ROI (subjects of action) to background; 2) most frames contain obvious motion change. We propose to use a two-stream (RGB and Depth) I3D architecture as our 3D-CNN model. Our approach outperforms the state-of-the-art methods on the OA and NTU RGB-D datasets.
ISBN:	1538670216 9781538670217
DOI:	10.1109/CBMI.2018.8516450