Speech Emotion Recognition Using Deep Learning
This study explores the application of deep learning techniques in recognizing emotional states from spoken language. Specifically, we employ Convolutional Neural Networks (CNNs) and the HuBERT model to analyze the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Our findings su...
Saved in:
Published in | 2024 XXVII International Conference on Soft Computing and Measurements (SCM) pp. 380 - 384 |
---|---|
Main Authors | , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
22.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | This study explores the application of deep learning techniques in recognizing emotional states from spoken language. Specifically, we employ Convolutional Neural Networks (CNNs) and the HuBERT model to analyze the Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS). Our findings suggest that deep learning models, particularly the HuBERT model, exhibit significant potential in accurately identifying speech emotions. The models were trained and tested on a dataset containing various emotional expressions, including happiness, sadness, anger, and fear, among others. The experimentation involved preprocessing the audio data, feature extraction using Mel Frequency Cepstral Coefficients (MFCCs), and implementing deep learning architectures for emotion classification. The HuBERT model, with its advanced self-supervised learning mechanism, outperformed traditional CNNs in terms of accuracy and efficiency. This research highlights the importance of selecting appropriate deep learning models and feature sets for the task of speech emotion recognition. Our analysis demonstrates that the HuBERT model, by leveraging contextual information and temporal dynamics in speech, offers a promising approach for developing more sensitive and accurate SER systems. These systems have potential applications in various fields, including mental health assessment, interactive voice response systems, and educational software, by enabling machines to understand and respond to human emotions more effectively. The findings of this study contribute to the ongoing discussion in the field of artificial intelligence about the best practices for implementing deep learning techniques in speech processing tasks. |
---|---|
DOI: | 10.1109/SCM62608.2024.10554077 |