Coherent Deep-Net Fusion To Classify Shots In Concert Videos

Varying types of shots is a fundamental element in the language of film, commonly used by a visual storytelling director. The technique is often used in creating professional recordings of a live concert, but meanwhile may not be appropriately applied in audience recordings of the same event. Such v...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on multimedia Vol. 20; no. 11; pp. 3123 - 3136
Main Authors Lin, Jen-Chun, Wei, Wen-Li, Liu, Tyng-Luh, Yang, Yi-Hsuan, Wang, Hsin-Min, Tyan, Hsiao-Rong, Liao, Hong-Yuan Mark
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 01.11.2018
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Varying types of shots is a fundamental element in the language of film, commonly used by a visual storytelling director. The technique is often used in creating professional recordings of a live concert, but meanwhile may not be appropriately applied in audience recordings of the same event. Such variations could cause the task of classifying shots in concert videos, professional or amateur, very challenging. To achieve more reliable shot classification, we propose a novel probabilistic-based approach, named as coherent classification net (CC-Net), by addressing three crucial issues. First, we focus on learning more effective features by fusing the layer-wise outputs extracted from a deep convolutional neural network (CNN), pretrained on a large-scale data set for object recognition. Second, we introduce a frame-wise classification scheme, the error weighted deep cross-correlation model (EW-Deep-CCM), to boost the classification accuracy. Specifically, the deep neural network-based cross-correlation model (deep-CCM) is constructed to not only model the extracted feature hierarchies of CNN independently, but also relate the statistical dependencies of paired features from different layers. Then, a Bayesian error weighting scheme for a classifier combination is adopted to explore the contributions from individual Deep-CCM classifiers to enhance the accuracy of shot classification in each image frame. Third, we feed the frame-wise classification results to a linear-chain conditional random field module to refine the shot predictions by taking into account the global and temporal regularities. We provide extensive experimental results on a data set of live concert videos to demonstrate the advantage of the proposed CC-Net over existing popular fusion approaches for shot classification.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2018.2820904