A Lightweight Multi-Scale Model for Speech Emotion Recognition
Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network archite...
Saved in:
Published in | IEEE access Vol. 12; pp. 130228 - 130240 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
IEEE
2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recognizing emotional states from speech is essential for human-computer interaction. It is challenging to realize effective speech emotion recognition (SER) on platforms with limited memory capacity and computing power. In this paper, we propose a lightweight multi-scale deep neural network architecture for SER, which takes Mel Frequency Cepstral Coefficients (MFCCs) as input. In order to realize effective multi-scale feature extraction, we propose a new Inception module, named A_Inception. A_Inception combines the merits of Inception module and attention-based rectified linear units (AReLU) and thus can learn multi-scale features adaptively with low computational cost. Meanwhile, to extract most important emotional information, we propose a new multiscale cepstral attention and temporal-cepstral attention (MCA-TCA) module. The idea of MCA-TCA module is to focus on the key cepstral components and the key temporal-cepstral positions. Furthermore, a loss function combining Softmax loss and Center loss is adopted to supervise the model training so as to enhance the model's discriminative power. Experiments have been carried out on IEMOCAP, EMO-DB and SAVEE datasets to verify the performance of the proposed model and compare with the state-of-the-art SER models. Numerical results reveal that the proposed model has a small number of parameters (0.82 M) and much lower computational cost (81.64 MFLOPs) than compared models, and achieves impressive accuracy on all datasets considered. |
---|---|
ISSN: | 2169-3536 2169-3536 |
DOI: | 10.1109/ACCESS.2024.3432813 |