Enhancing Video Summarization via Vision-Language Embedding

This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization app...

Full description

Saved in:
Bibliographic Details
Published in2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1052 - 1060
Main Authors Plummer, Bryan A., Brown, Matthew, Lazebnik, Svetlana
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.07.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.
ISSN:1063-6919
DOI:10.1109/CVPR.2017.118