End-to-End Learning of Action Detection from Frame Glimpses in Videos

In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypothe...

Full description

Saved in:

Bibliographic Details
Published in	2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2678 - 2687
Main Authors	Serena Yeung, Russakovsky, Olga, Mori, Greg, Li Fei-Fei
Format	Conference Proceeding
Language	English
Published	IEEE 01.06.2016
Subjects	Backpropagation Computational modeling Computer vision Feature extraction Sports equipment Training Videos
Online Access	Get full text
ISSN	1063-6919
DOI	10.1109/CVPR.2016.293

Cover

Loading…

Abstract	In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.
AbstractList	In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames.
Author	Russakovsky, Olga Mori, Greg Li Fei-Fei Serena Yeung
Author_xml	– sequence: 1 surname: Serena Yeung fullname: Serena Yeung email: serena@cs.stanford.edu organization: Stanford Univ., Stanford, CA, USA – sequence: 2 givenname: Olga surname: Russakovsky fullname: Russakovsky, Olga email: olgarus@cmu.edu organization: Stanford Univ., Stanford, CA, USA – sequence: 3 givenname: Greg surname: Mori fullname: Mori, Greg email: mori@cs.sfu.ca organization: Simon Fraser Univ., Burnaby, BC, Canada – sequence: 4 surname: Li Fei-Fei fullname: Li Fei-Fei email: feifeili@cs.stanford.edu organization: Stanford Univ., Stanford, CA, USA
BookMark	eNotjL1OwzAYAA0CiVIyMrH4BRz82Yl_xiqkBSkSCEHXynW-IKPGruIsvD2V2uluubsnNzFFJOQReAnA7XOz_fgsBQdVCiuvSGG1gUppaUwNcE0WwJVkyoK9I0XOv5xzsMqAsQvStrFnc2In0A7dFEP8oWmgKz-HFOkLzni2YUojXU9uRLo5hPGYMdMQ6Tb0mPIDuR3cIWNx4ZJ8r9uv5pV175u3ZtWxALqeWQ1YIVrpPSCIwVvZq0pr7cQAgEYro5zqbV0bD-CcM55beYqsM8Lv904uydP5GxBxd5zC6Ka_ndaGKyXkP5HoTBE
CODEN	IEEPAD
ContentType	Conference Proceeding
DBID	6IE 6IH CBEJK RIE RIO
DOI	10.1109/CVPR.2016.293
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Applied Sciences Computer Science
EISBN	9781467388511 1467388513
EISSN	1063-6919
EndPage	2687
ExternalDocumentID	7780662
Genre	orig-research
GroupedDBID	23M 29F 29O 6IE 6IH 6IK ABDPE ACGFS ALMA_UNASSIGNED_HOLDINGS CBEJK IPLJI M43 RIE RIO RNS
ID	FETCH-LOGICAL-i175t-51e4ee93cc1e12fc93d64777a2f11e87686a6d9558c11aaa8c09351e9a82cbba3
IEDL.DBID	RIE
IngestDate	Wed Aug 27 01:54:52 EDT 2025
IsPeerReviewed	false
IsScholarly	true
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i175t-51e4ee93cc1e12fc93d64777a2f11e87686a6d9558c11aaa8c09351e9a82cbba3
PageCount	10
ParticipantIDs	ieee_primary_7780662
PublicationCentury	2000
PublicationDate	2016-June
PublicationDateYYYYMMDD	2016-06-01
PublicationDate_xml	– month: 06 year: 2016 text: 2016-June
PublicationDecade	2010
PublicationTitle	2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR)
PublicationTitleAbbrev	CVPR
PublicationYear	2016
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0001968189 ssj0023720 ssj0003211698
Score	2.5564208
Snippet	In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our...
SourceID	ieee
SourceType	Publisher
StartPage	2678
SubjectTerms	Backpropagation Computational modeling Computer vision Feature extraction Sports equipment Training Videos
Title	End-to-End Learning of Action Detection from Frame Glimpses in Videos
URI	https://ieeexplore.ieee.org/document/7780662
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJ6YCLeJbHhhxiuMktkdUWiqkogrRqlvl2BcUAUlF0oVfj52kLUIMTHFuiCwnzt353XuH0HUoBU1A2o3EOCcOvyUKTECEuRU80dSn4LjDk6doPAseF-GihW62XBgAqIrPwHPDCss3uV67o7I-58IJlu-hPZu41Vyt3XmKjKzvkdt7ZjObSG4RBd91Y9lpbPYH8-mzK-yKPN9hzj86q1SOZdRBk82U6nqSN29dxp7--qXW-N85H6DejsKHp1vndIhakB2hThNz4mZHF9a0aeuwsXXRcJgZUubEXnAjv_qK8wTfVRQIfA8l1CNHTcEjV92FH97Tj1UBBU4zPE8N5EUPzUbDl8GYNN0WSGpDiJKEFAIAybSmQP1ES2YcSZUrP6EU7E9TRCoyMgyFplQpJbTDUClIJXwdx4odo3aWZ3CCsM3iYptHcUFZGCgIBUtsFGEiFXJmnydPUdct1HJVC2osmzU6-9t8jvbdi6rrsy5Qu_xcw6WNBMr4qvoEvgE2dK6u
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT4MwFG7mPOhp6mb8bQ8eBS1Q2h7N3Jy6LYvZlt2WUh6GqLAIu_jX2wLbjPHgifIOpCmU916_930PoSsqOIlA6I3kMmYZ_NaSEHoWD285ixRxCBju8GDo9ybe04zOauh6zYUBgKL4DGwzLLD8MFVLc1R2wxg3guVbaFv7fUpKttbmREX42vuI9b2rcxtfrDEFx_Rj2ahs3rSnoxdT2uXbjkGdf_RWKVxLt4EGq0mVFSVv9jIPbPX1S6_xv7PeQ60NiQ-P1u5pH9UgOUCNKurE1Z7OtGnV2GFla6JOJwmtPLX0BVcCrK84jfBdQYLA95BDOTLkFNw19V344T3-WGSQ4TjB0ziENGuhSbczbvesqt-CFesgIrcoAQ9AuEoRIE6khBsamiqTTkQI6N8m96UfCkq5IkRKyZVBUQkIyR0VBNI9RPUkTeAIYZ3HBTqTYpy41JNAuRvpOCL0JWWufp44Rk2zUPNFKakxr9bo5G_zJdrpjQf9ef9x-HyKds1LK6u1zlA9_1zCuY4L8uCi-By-AYb2sfc
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=End-to-End+Learning+of+Action+Detection+from+Frame+Glimpses+in+Videos&rft.au=Serena+Yeung&rft.au=Russakovsky%2C+Olga&rft.au=Mori%2C+Greg&rft.au=Li+Fei-Fei&rft.date=2016-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=2678&rft.epage=2687&rft_id=info:doi/10.1109%2FCVPR.2016.293&rft.externalDocID=7780662