End-to-End Learning of Action Detection from Frame Glimpses in Videos
In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypothe...
Saved in:
Published in | 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 2678 - 2687 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.06.2016
|
Subjects | |
Online Access | Get full text |
ISSN | 1063-6919 |
DOI | 10.1109/CVPR.2016.293 |
Cover
Loading…
Abstract | In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames. |
---|---|
AbstractList | In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our intuition is that the process of detecting actions is naturally one of observation and refinement: observing moments in video, and refining hypotheses about when an action is occurring. Based on this insight, we formulate our model as a recurrent neural network-based agent that interacts with a video over time. The agent observes video frames and decides both where to look next and when to emit a prediction. Since backpropagation is not adequate in this non-differentiable setting, we use REINFORCE to learn the agent's decision policy. Our model achieves state-of-the-art results on the THUMOS'14 and ActivityNet datasets while observing only a fraction (2% or less) of the video frames. |
Author | Russakovsky, Olga Mori, Greg Li Fei-Fei Serena Yeung |
Author_xml | – sequence: 1 surname: Serena Yeung fullname: Serena Yeung email: serena@cs.stanford.edu organization: Stanford Univ., Stanford, CA, USA – sequence: 2 givenname: Olga surname: Russakovsky fullname: Russakovsky, Olga email: olgarus@cmu.edu organization: Stanford Univ., Stanford, CA, USA – sequence: 3 givenname: Greg surname: Mori fullname: Mori, Greg email: mori@cs.sfu.ca organization: Simon Fraser Univ., Burnaby, BC, Canada – sequence: 4 surname: Li Fei-Fei fullname: Li Fei-Fei email: feifeili@cs.stanford.edu organization: Stanford Univ., Stanford, CA, USA |
BookMark | eNotjL1OwzAYAA0CiVIyMrH4BRz82Yl_xiqkBSkSCEHXynW-IKPGruIsvD2V2uluubsnNzFFJOQReAnA7XOz_fgsBQdVCiuvSGG1gUppaUwNcE0WwJVkyoK9I0XOv5xzsMqAsQvStrFnc2In0A7dFEP8oWmgKz-HFOkLzni2YUojXU9uRLo5hPGYMdMQ6Tb0mPIDuR3cIWNx4ZJ8r9uv5pV175u3ZtWxALqeWQ1YIVrpPSCIwVvZq0pr7cQAgEYro5zqbV0bD-CcM55beYqsM8Lv904uydP5GxBxd5zC6Ka_ndaGKyXkP5HoTBE |
CODEN | IEEPAD |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/CVPR.2016.293 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE/IET Electronic Library url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Applied Sciences Computer Science |
EISBN | 9781467388511 1467388513 |
EISSN | 1063-6919 |
EndPage | 2687 |
ExternalDocumentID | 7780662 |
Genre | orig-research |
GroupedDBID | 23M 29F 29O 6IE 6IH 6IK ABDPE ACGFS ALMA_UNASSIGNED_HOLDINGS CBEJK IPLJI M43 RIE RIO RNS |
ID | FETCH-LOGICAL-i175t-51e4ee93cc1e12fc93d64777a2f11e87686a6d9558c11aaa8c09351e9a82cbba3 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 01:54:52 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i175t-51e4ee93cc1e12fc93d64777a2f11e87686a6d9558c11aaa8c09351e9a82cbba3 |
PageCount | 10 |
ParticipantIDs | ieee_primary_7780662 |
PublicationCentury | 2000 |
PublicationDate | 2016-June |
PublicationDateYYYYMMDD | 2016-06-01 |
PublicationDate_xml | – month: 06 year: 2016 text: 2016-June |
PublicationDecade | 2010 |
PublicationTitle | 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) |
PublicationTitleAbbrev | CVPR |
PublicationYear | 2016 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0001968189 ssj0023720 ssj0003211698 |
Score | 2.5564208 |
Snippet | In this work we introduce a fully end-to-end approach for action detection in videos that learns to directly predict the temporal bounds of actions. Our... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 2678 |
SubjectTerms | Backpropagation Computational modeling Computer vision Feature extraction Sports equipment Training Videos |
Title | End-to-End Learning of Action Detection from Frame Glimpses in Videos |
URI | https://ieeexplore.ieee.org/document/7780662 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV09T8MwELVKJ6YCLeJbHhhxiuMktkdUWiqkogrRqlvl2BcUAUlF0oVfj52kLUIMTHFuiCwnzt353XuH0HUoBU1A2o3EOCcOvyUKTECEuRU80dSn4LjDk6doPAseF-GihW62XBgAqIrPwHPDCss3uV67o7I-58IJlu-hPZu41Vyt3XmKjKzvkdt7ZjObSG4RBd91Y9lpbPYH8-mzK-yKPN9hzj86q1SOZdRBk82U6nqSN29dxp7--qXW-N85H6DejsKHp1vndIhakB2hThNz4mZHF9a0aeuwsXXRcJgZUubEXnAjv_qK8wTfVRQIfA8l1CNHTcEjV92FH97Tj1UBBU4zPE8N5EUPzUbDl8GYNN0WSGpDiJKEFAIAybSmQP1ES2YcSZUrP6EU7E9TRCoyMgyFplQpJbTDUClIJXwdx4odo3aWZ3CCsM3iYptHcUFZGCgIBUtsFGEiFXJmnydPUdct1HJVC2osmzU6-9t8jvbdi6rrsy5Qu_xcw6WNBMr4qvoEvgE2dK6u |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3PT4MwFG7mPOhp6mb8bQ8eBS1Q2h7N3Jy6LYvZlt2WUh6GqLAIu_jX2wLbjPHgifIOpCmU916_930PoSsqOIlA6I3kMmYZ_NaSEHoWD285ixRxCBju8GDo9ybe04zOauh6zYUBgKL4DGwzLLD8MFVLc1R2wxg3guVbaFv7fUpKttbmREX42vuI9b2rcxtfrDEFx_Rj2ahs3rSnoxdT2uXbjkGdf_RWKVxLt4EGq0mVFSVv9jIPbPX1S6_xv7PeQ60NiQ-P1u5pH9UgOUCNKurE1Z7OtGnV2GFla6JOJwmtPLX0BVcCrK84jfBdQYLA95BDOTLkFNw19V344T3-WGSQ4TjB0ziENGuhSbczbvesqt-CFesgIrcoAQ9AuEoRIE6khBsamiqTTkQI6N8m96UfCkq5IkRKyZVBUQkIyR0VBNI9RPUkTeAIYZ3HBTqTYpy41JNAuRvpOCL0JWWufp44Rk2zUPNFKakxr9bo5G_zJdrpjQf9ef9x-HyKds1LK6u1zlA9_1zCuY4L8uCi-By-AYb2sfc |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2016+IEEE+Conference+on+Computer+Vision+and+Pattern+Recognition+%28CVPR%29&rft.atitle=End-to-End+Learning+of+Action+Detection+from+Frame+Glimpses+in+Videos&rft.au=Serena+Yeung&rft.au=Russakovsky%2C+Olga&rft.au=Mori%2C+Greg&rft.au=Li+Fei-Fei&rft.date=2016-06-01&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=2678&rft.epage=2687&rft_id=info:doi/10.1109%2FCVPR.2016.293&rft.externalDocID=7780662 |