Asymmetric 3D Convolutional Neural Networks for action recognition

•We propose asymmetric one-directional 3D convolutions to approximate the traditional 3D convolution. The asymmetric 3D convolutions decrease parameters and computational cost significantly.•To improve the feature learning capacity of asymmetric 3D convolutional layers, we propose the local 3D convo...

Full description

Saved in:

Bibliographic Details
Published in	Pattern recognition Vol. 85; pp. 1 - 12
Main Authors	Yang, Hao, Yuan, Chunfeng, Li, Bing, Du, Yang, Xing, Junliang, Hu, Weiming, Maybank, Stephen J.
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.01.2019
Subjects	3D-CNN Action recognition Asymmetric 3D convolution MicroNets Asymmetric 3D convolution Action recognition MicroNets 3D-CNN
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•We propose asymmetric one-directional 3D convolutions to approximate the traditional 3D convolution. The asymmetric 3D convolutions decrease parameters and computational cost significantly.•To improve the feature learning capacity of asymmetric 3D convolutional layers, we propose the local 3D convolutional networks, MicroNets, which incorporate multi-scale 3D convolutional branches to handle the different scales convolutional features in videos.•Based on the MicroNets, we design asymmetric 3D convolutional deep model which outperforms the tradition 3D-CNN models on both effectiveness and efficiency.•We propose the multi-sources enhanced input to decrease the computational cost further by avoiding training two deep networks individually.Based on the above technical innovations, Our model outperforms all the tra- ditional 3D-CNN models in both effectiveness and efficiency, and is comparable with the recent state-of-the-art action recognition methods on two of the most challenging benchmarks, UCF-101 and HMDB-51 datasets. Convolutional Neural Network based action recognition methods have achieved significant improvements in recent years. The 3D convolution extends the 2D convolution to the spatial-temporal domain for better analysis of human activities in videos. The 3D convolution, however, involves many more parameters than the 2D convolution. Thus, it is much more expensive on computation, costly on storage, and difficult to learn. This work proposes efficient asymmetric one-directional 3D convolutions to approximate the traditional 3D convolution. To improve the feature learning capacity of asymmetric 3D convolutions, a set of local 3D convolutional networks, called MicroNets, are proposed by incorporating multi-scale 3D convolution branches. Then, an asymmetric 3D-CNN deep model is constructed by MicroNets for the action recognition task. Moreover, to avoid training two networks on the RGB and Flow frames separately as most works do, a simple but effective multi-source enhanced input is proposed, which fuses useful information of the RGB and Flow frame at the pre-processing stage. The asymmetric 3D-CNN model is evaluated on two of the most challenging action recognition benchmarks, UCF-101 and HMDB-51. The asymmetric 3D-CNN model outperforms all the traditional 3D-CNN models in both effectiveness and efficiency, and its performance is comparable with that of recent state-of-the-art action recognition methods on both benchmarks.
ISSN:	0031-3203 1873-5142
DOI:	10.1016/j.patcog.2018.07.028