Affect recognition from facial movements and body gestures by hierarchical deep spatio-temporal features and fusion strategy

Affect presentation is periodic and multi-modal, such as through facial movements, body gestures, and so on. Studies have shown that temporal selection and multi-modal combinations may benefit affect recognition. In this article, we therefore propose a spatio-temporal fusion model that extracts spat...

Full description

Saved in:

Bibliographic Details
Published in	Neural networks Vol. 105; pp. 36 - 51
Main Authors	Sun, Bo, Cao, Siming, He, Jun, Yu, Lejun
Format	Journal Article
Language	English
Published	United States Elsevier Ltd 01.09.2018
Subjects	Affect recognition Bilateral long short-term memory recurrent neural network Convolutional neural network Deep learning Deep spatio-temporal hierarchical feature Multi-modal feature fusion strategy Deep learning Bilateral long short-term memory recurrent neural network Multi-modal feature fusion strategy Affect recognition Deep spatio-temporal hierarchical feature Convolutional neural network
Online Access	Get full text
ISSN	0893-6080 1879-2782 1879-2782
DOI	10.1016/j.neunet.2017.11.021

Cover

Loading…

More Information
Summary:	Affect presentation is periodic and multi-modal, such as through facial movements, body gestures, and so on. Studies have shown that temporal selection and multi-modal combinations may benefit affect recognition. In this article, we therefore propose a spatio-temporal fusion model that extracts spatio-temporal hierarchical features based on select expressive components. In addition, a multi-modal hierarchical fusion strategy is presented. Our model learns the spatio-temporal hierarchical features from videos by a proposed deep network, which combines a convolutional neural networks (CNN), bilateral long short-term memory recurrent neural networks (BLSTM-RNN) with principal component analysis (PCA). Our approach handles each video as a “video sentence.” It first obtains a skeleton with the temporal selection process and then segments key words with a certain sliding window. Finally, it obtains the features with a deep network comprised of a video-skeleton and video-words. Our model combines the feature level and decision level fusion for fusing the multi-modal information. Experimental results showed that our model improved the multi-modal affect recognition accuracy rate from 95.13% in existing literature to 99.57% on a face and body (FABO) database, our results have been increased by 4.44%, and it obtained a macro average accuracy (MAA) up to 99.71%.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0893-6080 1879-2782 1879-2782
DOI:	10.1016/j.neunet.2017.11.021