Zero-Shot Event Detection Using Multi-modal Fusion of Weakly Supervised Concepts

Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of performing high-level event detection with no...

Full description

Saved in:

Bibliographic Details
Published in	2014 IEEE Conference on Computer Vision and Pattern Recognition pp. 2665 - 2672
Main Authors	Shuang Wu, Bondugula, Sravanthi, Luisier, Florian, Xiaodan Zhuang, Natarajan, Pradeep
Format	Conference Proceeding Journal Article
Language	English
Published	IEEE 01.06.2014
Subjects	Computer vision Concept Detection Content analysis Descriptions Detectors Feature extraction Image detection Multimodal Fusion Pattern recognition Similarity Speech Support vector machines Texts Training Vectors Video Event Detection Visualization Zero-shot Learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Current state-of-the-art systems for visual content analysis require large training sets for each class of interest, and performance degrades rapidly with fewer examples. In this paper, we present a general framework for the zeroshot learning problem of performing high-level event detection with no training exemplars, using only textual descriptions. This task goes beyond the traditional zero-shot framework of adapting a given set of classes with training data to unseen classes. We leverage video and image collections with free-form text descriptions from widely available web sources to learn a large bank of concepts, in addition to using several off-the-shelf concept detectors, speech, and video text for representing videos. We utilize natural language processing technologies to generate event description features. The extracted features are then projected to a common high-dimensional space using text expansion, and similarity is computed in this space. We present extensive experimental results on the large TRECVID MED [26] corpus to demonstrate our approach. Our results show that the proposed concept detection methods significantly outperform current attribute classifiers such as Classemes [34], ObjectBank [21], and SUN attributes[28] . Further, we find that fusion, both within as well as between modalities, is crucial for optimal performance.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
ISSN:	1063-6919 1063-6919 2575-7075
DOI:	10.1109/CVPR.2014.341