Jointly Learning Visual Poses and Pose Lexicon for Semantic Action Recognition

A novel method for semantic action recognition through learning a pose lexicon is presented in this paper. A pose lexicon comprises a set of semantic poses, a set of visual poses, and a probabilistic mapping between the visual and semantic poses. This paper assumes that both the visual poses and map...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 30; no. 2; pp. 457 - 467
Main Authors	Zhou, Lijuan, Li, Wanqing, Ogunbona, Philip, Zhang, Zhengyou
Format	Journal Article
Language	English
Published	New York IEEE 01.02.2020 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Alignment Computational modeling Conditional probability Datasets Expectation-maximization algorithms Hidden Markov models Integrated circuit modeling Learning Mapping Markov chains pose lexicon Probabilistic logic Recognition Semantic action recognition Semantics Statistical analysis Visual observation visual pose Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	A novel method for semantic action recognition through learning a pose lexicon is presented in this paper. A pose lexicon comprises a set of semantic poses, a set of visual poses, and a probabilistic mapping between the visual and semantic poses. This paper assumes that both the visual poses and mapping are hidden and proposes a method to simultaneously learn a visual pose model that estimates the likelihood of an observed video frame being generated from hidden visual poses, and a pose lexicon model establishes the probabilistic mapping between the hidden visual poses and the semantic poses parsed from textual instructions. Specifically, the proposed method consists of two-level hidden Markov models. One level represents the alignment between the visual poses and semantic poses. The other level represents a visual pose sequence, and each visual pose is modeled as a Gaussian mixture. An expectation-maximization algorithm is developed to train a pose lexicon. With the learned lexicon, action classification is formulated as a problem of finding the maximum posterior probability of a given sequence of video frames that follows a given sequence of semantic poses, constrained by the most likely visual pose and the alignment sequences. The proposed method was evaluated on MSRC-12, WorkoutSU-10, WorkoutUOW-18, Combined-15, Combined-17, and Combined-50 action datasets using cross-subject, cross-dataset, zero-shot, and seen/unseen protocols.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2019.2890829