Speech Emotion Recognition Using Deep Learning

This study explores the application of deep learning techniques in recognizing emotional states from spoken language. Specifically, we employ Convolutional Neural Networks (CNNs) and the HuBERT model to analyze the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Our findings su...

Full description

Saved in:
Bibliographic Details
Published in2024 XXVII International Conference on Soft Computing and Measurements (SCM) pp. 380 - 384
Main Authors Gismelbari, Mohamed A., Vixnin, Ilya I., Kovalev, Gregory M., Gogolev, Eugane E.
Format Conference Proceeding
LanguageEnglish
Published IEEE 22.05.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:This study explores the application of deep learning techniques in recognizing emotional states from spoken language. Specifically, we employ Convolutional Neural Networks (CNNs) and the HuBERT model to analyze the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Our findings suggest that deep learning models, particularly the HuBERT model, exhibit significant potential in accurately identifying speech emotions. The models were trained and tested on a dataset containing various emotional expressions, including happiness, sadness, anger, and fear, among others. The experimentation involved preprocessing the audio data, feature extraction using Mel Frequency Cepstral Coefficients (MFCCs), and implementing deep learning architectures for emotion classification. The HuBERT model, with its advanced self-supervised learning mechanism, outperformed traditional CNNs in terms of accuracy and efficiency. This research highlights the importance of selecting appropriate deep learning models and feature sets for the task of speech emotion recognition. Our analysis demonstrates that the HuBERT model, by leveraging contextual information and temporal dynamics in speech, offers a promising approach for developing more sensitive and accurate SER systems. These systems have potential applications in various fields, including mental health assessment, interactive voice response systems, and educational software, by enabling machines to understand and respond to human emotions more effectively. The findings of this study contribute to the ongoing discussion in the field of artificial intelligence about the best practices for implementing deep learning techniques in speech processing tasks.
DOI:10.1109/SCM62608.2024.10554077