Beyond short snippets: Deep networks for video classification

Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across...

Full description

Saved in:

Bibliographic Details
Published in	2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 4694 - 4702
Main Authors	Ng, Joe Yue-Hei, Hausknecht, Matthew, Vijayanarasimhan, Sudheendra, Vinyals, Oriol, Monga, Rajat, Toderici, George
Format	Conference Proceeding Journal Article
Language	English
Published	IEEE 01.06.2015
Subjects	Computer architecture Computer vision Image detection Image recognition Logic gates Networks Neural networks Object recognition Optical imaging Pattern recognition Retrieval Segmentation Time-domain analysis Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Convolutional neural networks (CNNs) have been extensively applied for image recognition problems giving state-of-the-art results on recognition, detection, segmentation and retrieval. In this work we propose and evaluate several deep neural network architectures to combine image information across a video over longer time periods than previously attempted. We propose two methods capable of handling full length videos. The first method explores various convolutional temporal feature pooling architectures, examining the various design choices which need to be made when adapting a CNN for this task. The second proposed method explicitly models the video as an ordered sequence of frames. For this purpose we employ a recurrent neural network that uses Long Short-Term Memory (LSTM) cells which are connected to the output of the underlying CNN. Our best networks exhibit significant performance improvements over previously published results on the Sports 1 million dataset (73.1% vs. 60.9%) and the UCF-101 datasets with (88.6% vs. 88.0%) and without additional optical flow information (82.6% vs. 73.0%).
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
ISSN:	1063-6919 1063-6919 2575-7075
DOI:	10.1109/CVPR.2015.7299101