Speech Emotion Recognition Using Deep Learning

This study explores the application of deep learning techniques in recognizing emotional states from spoken language. Specifically, we employ Convolutional Neural Networks (CNNs) and the HuBERT model to analyze the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Our findings su...

Full description

Saved in:

Bibliographic Details
Published in	2024 XXVII International Conference on Soft Computing and Measurements (SCM) pp. 380 - 384
Main Authors	Gismelbari, Mohamed A., Vixnin, Ilya I., Kovalev, Gregory M., Gogolev, Eugane E.
Format	Conference Proceeding
Language	English
Published	IEEE 22.05.2024
Subjects	Accuracy Analytical models Convolutional Neural Networks Data models Deep learning Emotion recognition HuBERT Model RAVDESS Dataset Speech Emotion Recognition Speech recognition Training data
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This study explores the application of deep learning techniques in recognizing emotional states from spoken language. Specifically, we employ Convolutional Neural Networks (CNNs) and the HuBERT model to analyze the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Our findings suggest that deep learning models, particularly the HuBERT model, exhibit significant potential in accurately identifying speech emotions. The models were trained and tested on a dataset containing various emotional expressions, including happiness, sadness, anger, and fear, among others. The experimentation involved preprocessing the audio data, feature extraction using Mel Frequency Cepstral Coefficients (MFCCs), and implementing deep learning architectures for emotion classification. The HuBERT model, with its advanced self-supervised learning mechanism, outperformed traditional CNNs in terms of accuracy and efficiency. This research highlights the importance of selecting appropriate deep learning models and feature sets for the task of speech emotion recognition. Our analysis demonstrates that the HuBERT model, by leveraging contextual information and temporal dynamics in speech, offers a promising approach for developing more sensitive and accurate SER systems. These systems have potential applications in various fields, including mental health assessment, interactive voice response systems, and educational software, by enabling machines to understand and respond to human emotions more effectively. The findings of this study contribute to the ongoing discussion in the field of artificial intelligence about the best practices for implementing deep learning techniques in speech processing tasks.
DOI:	10.1109/SCM62608.2024.10554077