Generating Personalized Summaries of Day Long Egocentric Videos

The popularity of egocentric cameras and their always-on nature has lead to the abundance of day long first-person videos. The highly redundant nature of these videos and extreme camera-shakes make them difficult to watch from beginning to end. These videos require efficient summarization tools for...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on pattern analysis and machine intelligence Vol. 45; no. 6; pp. 6832 - 6845
Main Authors Nagar, Pravin, Rathore, Anuj, Jawahar, C. V., Arora, Chetan
Format Journal Article
LanguageEnglish
Published United States IEEE 01.06.2023
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The popularity of egocentric cameras and their always-on nature has lead to the abundance of day long first-person videos. The highly redundant nature of these videos and extreme camera-shakes make them difficult to watch from beginning to end. These videos require efficient summarization tools for consumption. However, traditional summarization techniques developed for static surveillance videos or highly curated sports videos and movies are either not suitable or simply do not scale for such hours long videos in the wild. On the other hand, specialized summarization techniques developed for egocentric videos limit their focus to important objects and people. This paper presents a novel unsupervised reinforcement learning framework to summarize egocentric videos both in terms of length and the content. The proposed framework facilitates incorporating various prior preferences such as faces, places, or scene diversity and interactive user choice in terms of including or excluding the particular type of content. This approach can also be adapted to generate summaries of various lengths, making it possible to view even 1-minute summaries of one's entire day. When using the facial saliency-based reward, we show that our approach generates summaries focusing on social interactions, similar to the current state-of-the-art (SOTA). The quantitative comparisons on the benchmark Disney dataset show that our method achieves significant improvement in Relaxed F-Score (RFS) (29.60 compared to 19.21 from SOTA), BLEU score (0.68 compared to 0.67 from SOTA), Average Human Ranking (AHR), and unique events covered. Finally, we show that our technique can be applied to summarize traditional, short, hand-held videos as well, where we improve the SOTA F-score on benchmark SumMe and TVSum datasets from 41.4 to 46.40 and 57.6 to 58.3 respectively. We also provide a Pytorch implementation and a web demo at https://pravin74.github.io/Int-sum/index.html .
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0162-8828
1939-3539
2160-9292
DOI:10.1109/TPAMI.2021.3118077