Segment-based speech emotion recognition using recurrent neural networks

Recently, Recurrent Neural Networks (RNNs) have produced state-of-the-art results for Speech Emotion Recognition (SER). The choice of the appropriate time-scale for Low Level Descriptors (LLDs) (local features) and statistical functionals (global features) is key for a high performing SER system. In...

Full description

Saved in:
Bibliographic Details
Published in2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII) pp. 190 - 195
Main Authors Tzinis, Efthymios, Potamianos, Alexandras
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.10.2017
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recently, Recurrent Neural Networks (RNNs) have produced state-of-the-art results for Speech Emotion Recognition (SER). The choice of the appropriate time-scale for Low Level Descriptors (LLDs) (local features) and statistical functionals (global features) is key for a high performing SER system. In this paper, we investigate both local and global features and evaluate the performance at various time-scales (frame, phoneme, word or utterance). We show that for RNN models, extracting statistical functionals over speech segments that roughly correspond to the duration of a couple of words produces optimal accuracy. We report state-of-the-art SER performance on the IEMOCAP corpus at a significantly lower model and computational complexity.
ISSN:2156-8111
DOI:10.1109/ACII.2017.8273599