Bag of visual words and fusion methods for action recognition: Comprehensive study and good practice

•A Comprehensive study on the BoVW pipeline for action recognition task.•An evaluation and a generic analysis on 13 encoding methods.•Intra-normalization for supervector based encoding methods.•An evaluation on three fusion methods.•Several good practices of the BoVW pipeline for action recognition...

Full description

Saved in:

Bibliographic Details
Published in	Computer vision and image understanding Vol. 150; pp. 109 - 125
Main Authors	Peng, Xiaojiang, Wang, Limin, Wang, Xingxing, Qiao, Yu
Format	Journal Article
Language	English
Published	Elsevier Inc 01.09.2016
Subjects	Action recognition Bag of visual words Computer vision Dynamical systems Dynamics Encoding Feature encoding Fusion methods Moving object recognition Representations State of the art Visual Action recognition Fusion methods Bag of visual words Feature encoding
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•A Comprehensive study on the BoVW pipeline for action recognition task.•An evaluation and a generic analysis on 13 encoding methods.•Intra-normalization for supervector based encoding methods.•An evaluation on three fusion methods.•Several good practices of the BoVW pipeline for action recognition task. Video based action recognition is one of the important and challenging problems in computer vision research. Bag of visual words model (BoVW) with local features has been very popular for a long time and obtained the state-of-the-art performance on several realistic datasets, such as the HMDB51, UCF50, and UCF101. BoVW is a general pipeline to construct a global representation from local features, which is mainly composed of five steps; (i) feature extraction, (ii) feature pre-processing, (iii) codebook generation, (iv) feature encoding, and (v) pooling and normalization. Although many efforts have been made in each step independently in different scenarios, their effects on action recognition are still unknown. Meanwhile, video data exhibits different views of visual patterns , such as static appearance and motion dynamics. Multiple descriptors are usually extracted to represent these different views. Fusing these descriptors is crucial for boosting the final performance of an action recognition system. This paper aims to provide a comprehensive study of all steps in BoVW and different fusion methods, and uncover some good practices to produce a state-of-the-art action recognition system. Specifically, we explore two kinds of local features, ten kinds of encoding methods, eight kinds of pooling and normalization strategies, and three kinds of fusion methods. We conclude that every step is crucial for contributing to the final recognition rate and improper choice in one of the steps may counteract the performance improvement of other steps. Furthermore, based on our comprehensive study, we propose a simple yet effective representation, called hybrid supervector, by exploring the complementarity of different BoVW frameworks with improved dense trajectories. Using this representation, we obtain impressive results on the three challenging datasets; HMDB51 (61.9%), UCF50 (92.3%), and UCF101 (87.9%).
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1077-3142 1090-235X
DOI:	10.1016/j.cviu.2016.03.013