Enhancing Video Summarization via Vision-Language Embedding

This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization app...

Full description

Saved in:

Bibliographic Details
Published in	2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR) pp. 1052 - 1060
Main Authors	Plummer, Bryan A., Brown, Matthew, Lazebnik, Svetlana
Format	Conference Proceeding
Language	English
Published	IEEE 01.07.2017
Subjects	Cameras Google Image segmentation Optimization Semantics Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper addresses video summarization, or the problem of distilling a raw video into a shorter form while still capturing the original story. We show that visual representations supervised by freeform language make a good fit for this application by extending a recent submodular summarization approach [9] with representativeness and interestingness objectives computed on features from a joint vision-language embedding space. We perform an evaluation on two diverse datasets, UT Egocentric [18] and TV Episodes [45], and show that our new objectives give improved summarization ability compared to standard visual features alone. Our experiments also show that the vision-language embedding need not be trained on domainspecific data, but can be learned from standard still image vision-language datasets and transferred to video. A further benefit of our model is the ability to guide a summary using freeform text input at test time, allowing user customization.
ISSN:	1063-6919
DOI:	10.1109/CVPR.2017.118