Together Recognizing, Localizing and Summarizing Actions in Egocentric Videos

Analysis of egocentric video has recently drawn attention of researchers in the computer vision as well as multimedia communities. In this paper, we propose a weakly supervised superpixel level joint framework for localization, recognition and summarization of actions in an egocentric video. We firs...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 30; pp. 4330 - 4340
Main Authors	Sahu, Abhimanyu, Chowdhury, Ananda S.
Format	Journal Article
Language	English
Published	United States IEEE 2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Action recognition and localization Cameras Computer vision Feature extraction fractional knapsack Graphical representations Localization Location awareness Multimedia Random walk random walks Recognition sparse video representation graph Sports Streaming media summarization superpixels Trajectory Video data Videos
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Analysis of egocentric video has recently drawn attention of researchers in the computer vision as well as multimedia communities. In this paper, we propose a weakly supervised superpixel level joint framework for localization, recognition and summarization of actions in an egocentric video. We first recognize and localize single as well as multiple action(s) in each frame of an egocentric video and then construct a summary of these detected actions. The superpixel level solution helps in precise localization of actions in addition to improving the recognition accuracy. Superpixels are extracted within the central regions of the egocentric video frames; these central regions being determined through a previously developed center-surround model. A sparse spatio-temporal video representation graph is constructed in the deep feature space with the superpixels as nodes. A weakly supervised solution using random walks yields action labels for each superpixel. After determining action label(s) for each frame from its constituent superpixels, we apply a fractional knapsack type formulation for obtaining a summary (of actions). Experimental comparisons on publicly available ADL, GTEA, EGTEA Gaze+, EgoGesture, and EPIC-Kitchens datasets show the effectiveness of the proposed solution.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	1057-7149 1941-0042 1941-0042
DOI:	10.1109/TIP.2021.3070732