A Lightweight Multi-Scale Model for Speech Emotion Recognition

Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network archite...

Full description

Saved in:
Bibliographic Details
Published inIEEE access Vol. 12; pp. 130228 - 130240
Main Authors Li, Haoming, Zhao, Daqi, Wang, Jingwen, Wang, Deqiang
Format Journal Article
LanguageEnglish
Published IEEE 2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network architecture for SER, which takes Mel Frequency Cepstral Coefficients (MFCCs) as input. In order to realize effective multi-scale feature extraction, we propose a new Inception module, named A_Inception. A_Inception combines the merits of Inception module and attention-based rectified linear units (AReLU) and thus can learn multi-scale features adaptively with low computational cost. Meanwhile, to extract most important emotional information, we propose a new multiscale cepstral attention and temporal-cepstral attention (MCA-TCA) module. The idea of MCA-TCA module is to focus on the key cepstral components and the key temporal-cepstral positions. Furthermore, a loss function combining Softmax loss and Center loss is adopted to supervise the model training so as to enhance the model's discriminative power. Experiments have been carried out on IEMOCAP, EMO-DB and SAVEE datasets to verify the performance of the proposed model and compare with the state-of-the-art SER models. Numerical results reveal that the proposed model has a small number of parameters (0.82 M) and much lower computational cost (81.64 MFLOPs) than compared models, and achieves impressive accuracy on all datasets considered.
ISSN:2169-3536
2169-3536
DOI:10.1109/ACCESS.2024.3432813