Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?

The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved...

Full description

Saved in:
Bibliographic Details
Published in2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition pp. 6546 - 6555
Main Authors Hara, Kensho, Kataoka, Hirokatsu, Satoh, Yutaka
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2018
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available1.
AbstractList The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with spatio-temporal three-dimensional (3D) kernels. Recently, the performance levels of 3D CNNs in the field of action recognition have improved significantly. However, to date, conventional research has only explored relatively shallow 3D architectures. We examine the architectures of various 3D CNNs from relatively shallow to very deep ones on current video datasets. Based on the results of those experiments, the following conclusions could be obtained: (i) ResNet-18 training resulted in significant overfitting for UCF-101, HMDB-51, and ActivityNet but not for Kinetics. (ii) The Kinetics dataset has sufficient data for training of deep 3D CNNs, and enables training of up to 152 ResNets layers, interestingly similar to 2D ResNets on ImageNet. ResNeXt-101 achieved 78.4% average accuracy on the Kinetics test set. (iii) Kinetics pretrained simple 3D architectures outperforms complex 2D architectures, and the pretrained ResNeXt-101 achieved 94.5% and 70.2% on UCF-101 and HMDB-51, respectively. The use of 2D CNNs trained on ImageNet has produced significant progress in various tasks in image. We believe that using deep 3D CNNs together with Kinetics will retrace the successful history of 2D CNNs and ImageNet, and stimulate advances in computer vision for videos. The codes and pretrained models used in this study are publicly available1.
Author Satoh, Yutaka
Hara, Kensho
Kataoka, Hirokatsu
Author_xml – sequence: 1
  givenname: Kensho
  surname: Hara
  fullname: Hara, Kensho
– sequence: 2
  givenname: Hirokatsu
  surname: Kataoka
  fullname: Kataoka, Hirokatsu
– sequence: 3
  givenname: Yutaka
  surname: Satoh
  fullname: Satoh, Yutaka
BookMark eNotzLtOwzAUgGGDQKKUzAwsfoEE2ye2jyeEArSVqoDKZa2c-ASCmosSL317Bjr9wyf91-yiH3pi7FaKTErh7ouvt12mhMRMCIP6jCXOotSAxuRKuHO2kMJAapx0VyyZ518hhDIImOsFWxW-5--jj-0QqRuHyR84PPGiLGe-ozj5mnj8Ib5u5zhMRz40XJ3Y94FvOv9NJcWHG3bZ-MNMyalL9vny_FGs0-3ralM8btNW5TKmoc6bKjTWVQiuypGgboJ1xlGF0qMRaEDpYAR4VyutUBmNNoCzdVMZCLBkd__floj249R2fjruUVu0CPAHA1NLsQ
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR.2018.00685
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9781538664209
1538664208
EISSN 1063-6919
EndPage 6555
ExternalDocumentID 8578783
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i241t-dc4fbdf79b839b48e3cfd7969eb81a86086325d603a9c252826587d397cfb63d3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:52:16 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i241t-dc4fbdf79b839b48e3cfd7969eb81a86086325d603a9c252826587d397cfb63d3
PageCount 10
ParticipantIDs ieee_primary_8578783
PublicationCentury 2000
PublicationDate 2018-06
PublicationDateYYYYMMDD 2018-06-01
PublicationDate_xml – month: 06
  year: 2018
  text: 2018-06
PublicationDecade 2010
PublicationTitle 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition
PublicationTitleAbbrev CVPR
PublicationYear 2018
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0002683845
ssj0003211698
Score 2.6223595
Snippet The purpose of this study is to determine whether current video datasets have sufficient data for training very deep convolutional neural networks (CNNs) with...
SourceID ieee
SourceType Publisher
StartPage 6546
SubjectTerms Computer vision
Kernel
Kinetic theory
Task analysis
Three-dimensional displays
Training
Two dimensional displays
Title Can Spatiotemporal 3D CNNs Retrace the History of 2D CNNs and ImageNet?
URI https://ieeexplore.ieee.org/document/8578783
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJ6YCLeJbHhhJm8aOY08MgVKQGlWFom6VvyIhIEVtOsCv55yYghADWz6kyPIlee_sd-8QOofAqlxRGkSJsAGlIgwkAHWQ6D6gQQSgw1xx8ihjwym9m8WzBrrY1MJYayvxme26w2ov3yz02i2V9bh7vTjZQluQuNW1Wpv1lIhxwv0OmTsnkNkwwb2bTz8UvfRxPHFaLieeZK538o92KhWaDFpo9DWOWkTy3F2Xqqs_flk0_negO6jzXbeHxxtE2kUNW-yhliea2H_Gqza6SWWB7ysttbemesHkCqdZtsITWy4lPAeIIa49RN7xIseRvy0Lg29f4SeU2fKyg6aD64d0GPiOCsETIHUZGE1zZfJEKOBFinJLdG4SwYRVvC85g_yGRLFhIZFCRzGkY0BQEgOcReeKEUP2UbNYFPYAYSZJHkWaM00opZqrkEHIY2lVnMdGyUPUdvMyf6tNM-Z-So7-vnyMtl1kag3WCWqWy7U9BbQv1VkV5k8dpaXm
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT8IwFG4QD3pCBeNve_DoYKxd1508TBEUFoJguJH-WmLUYWAc9K_3dZtojAdv65Y0TV_X7732e99D6AIMKxNJqeMFoXEoDV1HAFA7gWoDGngAOswmJw9i1p3Qu6k_raDLdS6MMSYnn5mmfczv8vVcrexRWYvb5cXJBtqE7vx2ka21PlHxGCe8vCOzbQKxDQt5qefTdsNW9DgcWTaXpU8yWz35R0GVHE86NTT4GklBI3lurjLZVB-_RBr_O9Qd1PjO3MPDNSbtoopJ91CtdDVx-SMv6-g2Eil-yNnUpTjVCybXOIrjJR6ZbCGgH3ANcaEi8o7nCfbKzyLVuPcK21BssqsGmnRuxlHXKWsqOE-A1ZmjFU2kToJQgmckKTdEJToIWWgkbwvOIMIhnq-ZS0SoPB8CMnBRAg1ei0okI5rso2o6T80BwkyQxPMUZ4pQShWXLgOj-8JIP_G1FIeobudl9lbIZszKKTn6-_U52uqOB_1ZvxffH6Nta6WCkXWCqtliZU4B-zN5lpv8E_j4qS8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2018+IEEE%2FCVF+Conference+on+Computer+Vision+and+Pattern+Recognition&rft.atitle=Can+Spatiotemporal+3D+CNNs+Retrace+the+History+of+2D+CNNs+and+ImageNet%3F&rft.au=Hara%2C+Kensho&rft.au=Kataoka%2C+Hirokatsu&rft.au=Satoh%2C+Yutaka&rft.date=2018-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=6546&rft.epage=6555&rft_id=info:doi/10.1109%2FCVPR.2018.00685&rft.externalDocID=8578783