Evaluating deep learning architectures for Speech Emotion Recognition

Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-...

Full description

Saved in:

Bibliographic Details
Published in	Neural networks Vol. 92; pp. 60 - 68
Main Authors	Fayek, Haytham M., Lech, Margaret, Cavedon, Lawrence
Format	Journal Article
Language	English
Published	United States Elsevier Ltd 01.08.2017
Subjects	Affective computing Deep learning Emotion recognition Emotions Machine Learning Neural networks Neural Networks (Computer) Speech recognition Speech Recognition Software Deep learning Affective computing Emotion recognition Neural networks Speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models’ performances.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0893-6080 1879-2782
DOI:	10.1016/j.neunet.2017.02.013