VSS-Net: Visual Semantic Self-Mining Network for Video Summarization

Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable inf...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 34; no. 4; pp. 2775 - 2788
Main Authors	Zhang, Yunzuo, Liu, Yameng, Kang, Weili, Tao, Ran
Format	Journal Article
Language	English
Published	New York IEEE 01.04.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Cognitive tasks Computational modeling Context Context modeling Feature extraction Graphical representations information exchange Learning Predictions Segments self-mining semantic representation Semantics Streaming media Target detection Task analysis temporal cues Video data Video summarization Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Video summarization, with the target to detect valuable segments given untrimmed videos, is a meaningful yet understudied topic. Previous methods primarily consider inter-frame and inter-shot temporal dependencies, which might be insufficient to pinpoint important content due to limited valuable information that can be learned. To address this limitation, we elaborate on a Visual Semantic Self-mining Network (VSS-Net), a novel summarization framework motivated by the widespread success of cross-modality learning tasks. VSS-Net initially adopts a two-stream structure consisting of a Context Representation Graph (CRG) and a Video Semantics Encoder (VSE). They are jointly exploited to establish the groundwork for further boosting the capability of content awareness. Specifically, CRG is constructed using an edge-set strategy tailored to the hierarchical structure of videos, enriching visual features with local and non-local temporal cues from temporal order and visual relationship perspectives. Meanwhile, by learning visual similarity across features, VSE adaptively acquires an instructive video-level semantic representation of the input video from coarse to fine. Subsequently, the two streams converge in a Context-Semantics Interaction Layer (CSIL) to achieve sophisticated information exchange across frame-level temporal cues and video-level semantic representation, guaranteeing informative representations and boosting the sensitivity to important segments. Eventually, importance scores are predicted utilizing a prediction head, followed by key shot selection. We evaluate the proposed framework and demonstrate its effectiveness and superiority against state-of-the-art methods on the widely used benchmarks.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2023.3312325