Evaluating deep learning architectures for Speech Emotion Recognition
Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-...
Saved in:
Published in | Neural networks Vol. 92; pp. 60 - 68 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
United States
Elsevier Ltd
01.08.2017
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Speech Emotion Recognition (SER) can be regarded as a static or dynamic classification problem, which makes SER an excellent test bed for investigating and comparing various deep learning architectures. We describe a frame-based formulation to SER that relies on minimal speech processing and end-to-end deep learning to model intra-utterance dynamics. We use the proposed SER system to empirically explore feed-forward and recurrent neural network architectures and their variants. Experiments conducted illuminate the advantages and limitations of these architectures in paralinguistic speech recognition and emotion recognition in particular. As a result of our exploration, we report state-of-the-art results on the IEMOCAP database for speaker-independent SER and present quantitative and qualitative assessments of the models’ performances. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 0893-6080 1879-2782 |
DOI: | 10.1016/j.neunet.2017.02.013 |