Learning Salient Features for Speech Emotion Recognition Using Convolutional Neural Networks

As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect- related , discriminative features. In this paper, we propose to learn a...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 16; no. 8; pp. 2203 - 2213
Main Authors	Mao, Qirong, Dong, Ming, Huang, Zhengwei, Zhan, Yongzhao
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.12.2014 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Affective-salient discriminative feature analysis Convolution convolutional neural networks Distortion Emotion recognition Emotions Feature extraction feature learning Feature recognition Invariants Multimedia Neural networks Spectrogram Speech speech emotion recognition Speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	As an essential way of human emotional behavior understanding, speech emotion recognition (SER) has attracted a great deal of attention in human-centered signal processing. Accuracy in SER heavily depends on finding good affect- related , discriminative features. In this paper, we propose to learn affect-salient features for SER using convolutional neural networks (CNN). The training of CNN involves two stages. In the first stage, unlabeled samples are used to learn local invariant features (LIF) using a variant of sparse auto-encoder (SAE) with reconstruction penalization. In the second step, LIF is used as the input to a feature extractor, salient discriminative feature analysis (SDFA), to learn affect-salient, discriminative features using a novel objective function that encourages feature saliency, orthogonality, and discrimination for SER. Our experimental results on benchmark datasets show that our approach leads to stable and robust recognition performance in complex scenes (e.g., with speaker and language variation, and environment distortion) and outperforms several well-established SER features.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2014.2360798