Segment-based speech emotion recognition using recurrent neural networks

Recently, Recurrent Neural Networks (RNNs) have produced state-of-the-art results for Speech Emotion Recognition (SER). The choice of the appropriate time-scale for Low Level Descriptors (LLDs) (local features) and statistical functionals (global features) is key for a high performing SER system. In...

Full description

Saved in:

Bibliographic Details
Published in	2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) pp. 190 - 195
Main Authors	Tzinis, Efthymios, Potamianos, Alexandras
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2017
Subjects	Emotion recognition Feature extraction Hidden Markov models Recurrent neural networks Speech Support vector machines Training
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recently, Recurrent Neural Networks (RNNs) have produced state-of-the-art results for Speech Emotion Recognition (SER). The choice of the appropriate time-scale for Low Level Descriptors (LLDs) (local features) and statistical functionals (global features) is key for a high performing SER system. In this paper, we investigate both local and global features and evaluate the performance at various time-scales (frame, phoneme, word or utterance). We show that for RNN models, extracting statistical functionals over speech segments that roughly correspond to the duration of a couple of words produces optimal accuracy. We report state-of-the-art SER performance on the IEMOCAP corpus at a significantly lower model and computational complexity.
ISSN:	2156-8111
DOI:	10.1109/ACII.2017.8273599