Att-Net: Enhanced emotion recognition system using lightweight self-attention module
Speech emotion recognition (SER) is an active research field of digital signal processing and plays a crucial role in numerous applications of Human–computer interaction (HCI). Nowadays, the baseline state of the art systems has quite a low accuracy and high computations, which needs upgrading to ma...
Saved in:
Published in | Applied soft computing Vol. 102; p. 107101 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
01.04.2021
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Speech emotion recognition (SER) is an active research field of digital signal processing and plays a crucial role in numerous applications of Human–computer interaction (HCI). Nowadays, the baseline state of the art systems has quite a low accuracy and high computations, which needs upgrading to make it reasonable for real-time industrial uses such as detection of content from speech data. The main intent for low recognition rate and high computational cost is a scarceness of datasets, model configuration, and patterns recognition that is the supreme stimulating work for building a robust SER system. In this study, we address these problems and propose a simple and lightweight deep learning-based self-attention module (SAM) for SER system. The transitional features map is given to SAM, which produces efficiently the channel and spatial axes attention map with insignificant overheads. We use a multi-layer perceptron (MLP) in channel attention to extracting global cues and a special dilated convolutional neural network (CNN) in spatial attention to extract spatial info from input tensor. Moreover, we merge, spatial and channel attention maps to produce a combine attention weights as a self-attention module. We placed SAM in the middle of convolutional and connected layers and trained it in an end-to-end mode. The ablation study and comprehensive experimentations are accompanied over IEMOCAP, RAVDESS, and EMO-DB speech emotion datasets. The proposed SER system shows consistent improvements in overall experiments for all datasets and shows 78.01%, 80.00%, and 93.00% average recall, respectively.
•Proposed a simple and lightweight cognitive model for smart detection systems based on speech emotions.•Utilized dilated convolutional layers and introduced a two-stream self-attention module for classification problems.•Utilized two-channels in attention mechanism, which recognize global cues using MLP and spatial cues using special dilated CNN.•Indorsed that the suggested framework of SER system is a recent achievement of deep learning approaches for SER.•IEMOCAP, EMO-DB, and RAVDESS datasets experimentally used with different perspectives and obtained 78.01%, 93.00%, and 80.00% accuracy, respectively. |
---|---|
ISSN: | 1568-4946 1872-9681 |
DOI: | 10.1016/j.asoc.2021.107101 |