Speech Emotion Recognition Using Deep Learning Hybrid Models

Speech Emotion Recognition (SER) has been essential to Human-Computer Interaction (HCI) and other complex speech processing systems over the past decade. Due to the emotive differences between different speakers, SER is a complex and challenging process. The features retrieved from speech signals ar...

Full description

Saved in:

Bibliographic Details
Published in	2022 International Conference on Emerging Technologies in Electronics, Computing and Communication (ICETECC) pp. 1 - 5
Main Authors	Bhanbhro, Jamsher, Talpur, Shahnawaz, Memon, Asif Aziz
Format	Conference Proceeding
Language	English
Published	IEEE 07.12.2022
Subjects	AWGN CNN Computational modeling Deep learning Emotion recognition Human computer interaction SER Speech Emotion Recognition Speech enhancement Speech recognition Stacked CNN
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech Emotion Recognition (SER) has been essential to Human-Computer Interaction (HCI) and other complex speech processing systems over the past decade. Due to the emotive differences between different speakers, SER is a complex and challenging process. The features retrieved from speech signals are crucial to SER systems' performance. It is still challenging to develop efficient feature extracting and classification models. This study suggested hybrid deep learning models for accurately extracting crucial features and enhancing predictions with higher probabilities. Initially, the Mel spectrogram's temporal features are trained using a combination of stacked Convolutional Neural Networks (CNN) & Long-term short memory (LSTM). The said model performs well. For enhancing the speech, samples are initially preprocessed using data improvement and dataset balancing techniques. The RAVDNESS dataset is used in this study which contains 1440 samples of audio in North American English accent. The strength of the CNN algorithm is used for obtaining spatial features and sequence encoding conversion, which generates accuracy above 93.9% for the model on mentioned data set when classifying emotions into one of eight categories. The model is generalized using Additive white Gaussian noise (AWGN) and Dropout techniques.
DOI:	10.1109/ICETECC56662.2022.10069212