Att-Net: Enhanced emotion recognition system using lightweight self-attention module

Speech emotion recognition (SER) is an active research field of digital signal processing and plays a crucial role in numerous applications of Human–computer interaction (HCI). Nowadays, the baseline state of the art systems has quite a low accuracy and high computations, which needs upgrading to ma...

Full description

Saved in:

Bibliographic Details
Published in	Applied soft computing Vol. 102; p. 107101
Main Authors	Mustaqeem, Kwon, Soonil
Format	Journal Article
Language	English
Published	Elsevier B.V 01.04.2021
Subjects	Affective computing Artificial intelligence Attention mechanism Emotion recognition Lightweight CNN Self-attention module Spectrograms Affective computing Attention mechanism Emotion recognition Spectrograms Lightweight CNN Self-attention module Artificial intelligence
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech emotion recognition (SER) is an active research field of digital signal processing and plays a crucial role in numerous applications of Human–computer interaction (HCI). Nowadays, the baseline state of the art systems has quite a low accuracy and high computations, which needs upgrading to make it reasonable for real-time industrial uses such as detection of content from speech data. The main intent for low recognition rate and high computational cost is a scarceness of datasets, model configuration, and patterns recognition that is the supreme stimulating work for building a robust SER system. In this study, we address these problems and propose a simple and lightweight deep learning-based self-attention module (SAM) for SER system. The transitional features map is given to SAM, which produces efficiently the channel and spatial axes attention map with insignificant overheads. We use a multi-layer perceptron (MLP) in channel attention to extracting global cues and a special dilated convolutional neural network (CNN) in spatial attention to extract spatial info from input tensor. Moreover, we merge, spatial and channel attention maps to produce a combine attention weights as a self-attention module. We placed SAM in the middle of convolutional and connected layers and trained it in an end-to-end mode. The ablation study and comprehensive experimentations are accompanied over IEMOCAP, RAVDESS, and EMO-DB speech emotion datasets. The proposed SER system shows consistent improvements in overall experiments for all datasets and shows 78.01%, 80.00%, and 93.00% average recall, respectively. •Proposed a simple and lightweight cognitive model for smart detection systems based on speech emotions.•Utilized dilated convolutional layers and introduced a two-stream self-attention module for classification problems.•Utilized two-channels in attention mechanism, which recognize global cues using MLP and spatial cues using special dilated CNN.•Indorsed that the suggested framework of SER system is a recent achievement of deep learning approaches for SER.•IEMOCAP, EMO-DB, and RAVDESS datasets experimentally used with different perspectives and obtained 78.01%, 93.00%, and 80.00% accuracy, respectively.
ISSN:	1568-4946 1872-9681
DOI:	10.1016/j.asoc.2021.107101