End-to-End Learning of Action Detection from Frame Glimpses in Videos

In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypothe...

Full description

Saved in:
Bibliographic Details
Published in2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2678 - 2687
Main Authors Serena Yeung, Russakovsky, Olga, Mori, Greg, Li Fei-Fei
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.06.2016
Subjects
Online AccessGet full text
ISSN1063-6919
DOI10.1109/CVPR.2016.293

Cover

Loading…
Abstract In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.
AbstractList In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.
Author Russakovsky, Olga
Mori, Greg
Li Fei-Fei
Serena Yeung
Author_xml – sequence: 1
  surname: Serena Yeung
  fullname: Serena Yeung
  email: serena@cs.stanford.edu
  organization: Stanford Univ., Stanford, CA, USA
– sequence: 2
  givenname: Olga
  surname: Russakovsky
  fullname: Russakovsky, Olga
  email: olgarus@cmu.edu
  organization: Stanford Univ., Stanford, CA, USA
– sequence: 3
  givenname: Greg
  surname: Mori
  fullname: Mori, Greg
  email: mori@cs.sfu.ca
  organization: Simon Fraser Univ., Burnaby, BC, Canada
– sequence: 4
  surname: Li Fei-Fei
  fullname: Li Fei-Fei
  email: feifeili@cs.stanford.edu
  organization: Stanford Univ., Stanford, CA, USA
BookMark eNotjL1OwzAYAA0CiVIyMrH4BRz82Yl_xiqkBSkSCEHXynW-IKPGruIsvD2V2uluubsnNzFFJOQReAnA7XOz_fgsBQdVCiuvSGG1gUppaUwNcE0WwJVkyoK9I0XOv5xzsMqAsQvStrFnc2In0A7dFEP8oWmgKz-HFOkLzni2YUojXU9uRLo5hPGYMdMQ6Tb0mPIDuR3cIWNx4ZJ8r9uv5pV175u3ZtWxALqeWQ1YIVrpPSCIwVvZq0pr7cQAgEYro5zqbV0bD-CcM55beYqsM8Lv904uydP5GxBxd5zC6Ka_ndaGKyXkP5HoTBE
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR.2016.293
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE/IET Electronic Library
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
Computer Science
EISBN 9781467388511
1467388513
EISSN 1063-6919
EndPage 2687
ExternalDocumentID 7780662
Genre orig-research
GroupedDBID 23M
29F
29O
6IE
6IH
6IK
ABDPE
ACGFS
ALMA_UNASSIGNED_HOLDINGS
CBEJK
IPLJI
M43
RIE
RIO
RNS
ID FETCH-LOGICAL-i175t-51e4ee93cc1e12fc93d64777a2f11e87686a6d9558c11aaa8c09351e9a82cbba3
IEDL.DBID RIE
IngestDate Wed Aug 27 01:54:52 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i175t-51e4ee93cc1e12fc93d64777a2f11e87686a6d9558c11aaa8c09351e9a82cbba3
PageCount 10
ParticipantIDs ieee_primary_7780662
PublicationCentury 2000
PublicationDate 2016-June
PublicationDateYYYYMMDD 2016-06-01
PublicationDate_xml – month: 06
  year: 2016
  text: 2016-June
PublicationDecade 2010
PublicationTitle 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev CVPR
PublicationYear 2016
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0001968189
ssj0023720
ssj0003211698
Score 2.5564208
Snippet In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our...
SourceID ieee
SourceType Publisher
StartPage 2678
SubjectTerms Backpropagation
Computational modeling
Computer vision
Feature extraction
Sports equipment
Training
Videos
Title End-to-End Learning of Action Detection from Frame Glimpses in Videos
URI https://ieeexplore.ieee.org/document/7780662
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJ6YCLeJbHhhxiuMktkdUWiqkogrRqlvl2BcUAUlF0oVfj52kLUIMTHFuiCwnzt353XuH0HUoBU1A2o3EOCcOvyUKTECEuRU80dSn4LjDk6doPAseF-GihW62XBgAqIrPwHPDCss3uV67o7I-58IJlu-hPZu41Vyt3XmKjKzvkdt7ZjObSG4RBd91Y9lpbPYH8-mzK-yKPN9hzj86q1SOZdRBk82U6nqSN29dxp7--qXW-N85H6DejsKHp1vndIhakB2hThNz4mZHF9a0aeuwsXXRcJgZUubEXnAjv_qK8wTfVRQIfA8l1CNHTcEjV92FH97Tj1UBBU4zPE8N5EUPzUbDl8GYNN0WSGpDiJKEFAIAybSmQP1ES2YcSZUrP6EU7E9TRCoyMgyFplQpJbTDUClIJXwdx4odo3aWZ3CCsM3iYptHcUFZGCgIBUtsFGEiFXJmnydPUdct1HJVC2osmzU6-9t8jvbdi6rrsy5Qu_xcw6WNBMr4qvoEvgE2dK6u
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT4MwFG7mPOhp6mb8bQ8eBS1Q2h7N3Jy6LYvZlt2WUh6GqLAIu_jX2wLbjPHgifIOpCmU916_930PoSsqOIlA6I3kMmYZ_NaSEHoWD285ixRxCBju8GDo9ybe04zOauh6zYUBgKL4DGwzLLD8MFVLc1R2wxg3guVbaFv7fUpKttbmREX42vuI9b2rcxtfrDEFx_Rj2ahs3rSnoxdT2uXbjkGdf_RWKVxLt4EGq0mVFSVv9jIPbPX1S6_xv7PeQ60NiQ-P1u5pH9UgOUCNKurE1Z7OtGnV2GFla6JOJwmtPLX0BVcCrK84jfBdQYLA95BDOTLkFNw19V344T3-WGSQ4TjB0ziENGuhSbczbvesqt-CFesgIrcoAQ9AuEoRIE6khBsamiqTTkQI6N8m96UfCkq5IkRKyZVBUQkIyR0VBNI9RPUkTeAIYZ3HBTqTYpy41JNAuRvpOCL0JWWufp44Rk2zUPNFKakxr9bo5G_zJdrpjQf9ef9x-HyKds1LK6u1zlA9_1zCuY4L8uCi-By-AYb2sfc
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=End-to-End+Learning+of+Action+Detection+from+Frame+Glimpses+in+Videos&rft.au=Serena+Yeung&rft.au=Russakovsky%2C+Olga&rft.au=Mori%2C+Greg&rft.au=Li+Fei-Fei&rft.date=2016-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=2678&rft.epage=2687&rft_id=info:doi/10.1109%2FCVPR.2016.293&rft.externalDocID=7780662