A Lightweight Multi-Scale Model for Speech Emotion Recognition

Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network archite...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 12; pp. 130228 - 130240
Main Authors	Li, Haoming, Zhao, Daqi, Wang, Jingwen, Wang, Deqiang
Format	Journal Article
Language	English
Published	IEEE 2024
Subjects	attention mechanism Attention mechanisms center loss Computational modeling Convolutional neural networks Emotion recognition Feature extraction Long short term memory Mel frequency cepstral coefficient MFCC Spectrogram Speech emotion recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network architecture for SER, which takes Mel Frequency Cepstral Coefficients (MFCCs) as input. In order to realize effective multi-scale feature extraction, we propose a new Inception module, named A_Inception. A_Inception combines the merits of Inception module and attention-based rectified linear units (AReLU) and thus can learn multi-scale features adaptively with low computational cost. Meanwhile, to extract most important emotional information, we propose a new multiscale cepstral attention and temporal-cepstral attention (MCA-TCA) module. The idea of MCA-TCA module is to focus on the key cepstral components and the key temporal-cepstral positions. Furthermore, a loss function combining Softmax loss and Center loss is adopted to supervise the model training so as to enhance the model's discriminative power. Experiments have been carried out on IEMOCAP, EMO-DB and SAVEE datasets to verify the performance of the proposed model and compare with the state-of-the-art SER models. Numerical results reveal that the proposed model has a small number of parameters (0.82 M) and much lower computational cost (81.64 MFLOPs) than compared models, and achieves impressive accuracy on all datasets considered.
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3432813