Multiple attention convolutional-recurrent neural networks for speech emotion recognition

Speech Emotion Recognition is of great significance in the research field of human-computer interaction and affective computing. One of the major challenges for SER now lies in how to explore effective emotional features from lengthy utterances. However, since most of existing deep-learning based SE...

Full description

Saved in:
Bibliographic Details
Published in2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) pp. 1 - 8
Main Authors Zhang, Zhihao, Wang, Kunxia
Format Conference Proceeding
LanguageEnglish
Published IEEE 18.10.2022
Subjects
Online AccessGet full text
DOI10.1109/ACIIW57231.2022.10086021

Cover

Loading…
Abstract Speech Emotion Recognition is of great significance in the research field of human-computer interaction and affective computing. One of the major challenges for SER now lies in how to explore effective emotional features from lengthy utterances. However, since most of existing deep-learning based SERs adopt Log-Mel spectrograms as the input model, it is unable to fully convey the emotional information in the speech. Furthermore, limited extraction ability of the model may make it difficult to extract key emotional representations. As a result, in order to address the above issues, we propose a new convolutional recurrent network based on multiple attention, including convolutional neural network (CNN) and bidirectional long short-term memory network (BiLSTM) modules, using extracted Mel-spectrums and Fourier Coefficient features respectively, which helps to complement the emotional information. Further, the multiple attention mechanisms in our model are as follows: Spatial attention and channel attention mechanisms are added to the CNN module to focus on the key emotional area and locate more effective features. Temporal attention gives weights to different time series segment features after BiLSTM extracts sequence information. Experimental results show that the model achieves the WA (weighted accuracy) of 87.9%, 76.5%, and 75.2% respectively while the UA (unweighted accuracy) stands at 87.6%, 73.5%, 70.1 % respectively on EMODB, IEMOCAP, and EESDB speech datasets, which is better than most state-of-the-art methods.
AbstractList Speech Emotion Recognition is of great significance in the research field of human-computer interaction and affective computing. One of the major challenges for SER now lies in how to explore effective emotional features from lengthy utterances. However, since most of existing deep-learning based SERs adopt Log-Mel spectrograms as the input model, it is unable to fully convey the emotional information in the speech. Furthermore, limited extraction ability of the model may make it difficult to extract key emotional representations. As a result, in order to address the above issues, we propose a new convolutional recurrent network based on multiple attention, including convolutional neural network (CNN) and bidirectional long short-term memory network (BiLSTM) modules, using extracted Mel-spectrums and Fourier Coefficient features respectively, which helps to complement the emotional information. Further, the multiple attention mechanisms in our model are as follows: Spatial attention and channel attention mechanisms are added to the CNN module to focus on the key emotional area and locate more effective features. Temporal attention gives weights to different time series segment features after BiLSTM extracts sequence information. Experimental results show that the model achieves the WA (weighted accuracy) of 87.9%, 76.5%, and 75.2% respectively while the UA (unweighted accuracy) stands at 87.6%, 73.5%, 70.1 % respectively on EMODB, IEMOCAP, and EESDB speech datasets, which is better than most state-of-the-art methods.
Author Wang, Kunxia
Zhang, Zhihao
Author_xml – sequence: 1
  givenname: Zhihao
  surname: Zhang
  fullname: Zhang, Zhihao
  email: 1511827481@qq.com
  organization: School of electronic and information engineering, Anhui Jianzhu University,Anhui International Joint Research Center for Ancient Architecture Intellisencing and Multi-Dimensional Modeling,HeFei,China
– sequence: 2
  givenname: Kunxia
  surname: Wang
  fullname: Wang, Kunxia
  email: kxwang@ahjzu.edu.cn
  organization: Higher Education Institutes, School of electronic and information engineering, Anhui Jianzhu University,Key Laboratory of Architectural Acoustic Environment of Anhui,HeFei,China
BookMark eNo1j8tKxDAYhSPowhl9Axd5gdY_16bLoXgZGHGjiKsh6fzRYicpaar49nZGXX0HzgXOgpyGGJAQyqBkDOrrVbNev6iKC1Zy4LxkAEYDZydkwbRWUskaxDl5fZj63A09UpszhtzFQNsYPmM_HbTti4TtlNJs0YBTsv2M_BXTx0h9THQcENt3ivt4rM7h-Ba6g74gZ972I17-cUmeb2-emvti83i3blabouMgc2F1xU3lvWTeOeNapiwDKbTz0hlVG9gJh1w5ZAg1h1poacEw5GaH8xEUS3L1u9sh4nZI3d6m7-3_XfEDGh5Sbw
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ACIIW57231.2022.10086021
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan All Online (POP All Online) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore Digital Library
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
EISBN 1665454903
9781665454902
EndPage 8
ExternalDocumentID 10086021
Genre orig-research
GroupedDBID 6IE
6IL
CBEJK
RIE
RIL
ID FETCH-LOGICAL-i204t-a67287ff41fbb8bc15a10436bf4b85980d3be25be1e09209364a081e28de166e3
IEDL.DBID RIE
IngestDate Thu Jan 18 11:14:29 EST 2024
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i204t-a67287ff41fbb8bc15a10436bf4b85980d3be25be1e09209364a081e28de166e3
PageCount 8
ParticipantIDs ieee_primary_10086021
PublicationCentury 2000
PublicationDate 2022-Oct.-18
PublicationDateYYYYMMDD 2022-10-18
PublicationDate_xml – month: 10
  year: 2022
  text: 2022-Oct.-18
  day: 18
PublicationDecade 2020
PublicationTitle 2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW)
PublicationTitleAbbrev ACIIW
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
Score 1.8201442
Snippet Speech Emotion Recognition is of great significance in the research field of human-computer interaction and affective computing. One of the major challenges...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Affective computing
Convolutional neural networks
Emotion recognition
Feature extraction
Human-computer interaction
Multiple attention mechanisms
Recurrent neural networks
Speech emotion recognition
Speech recognition
Time series analysis
Title Multiple attention convolutional-recurrent neural networks for speech emotion recognition
URI https://ieeexplore.ieee.org/document/10086021
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjZ3PS8MwFMeD7uRJxYm_ycFruiRNsvQow7EJGx4cztNo0lcURzdcd_GvN68_JgqCp5ZSaMkrvPfS7_fzCLnNQpGeqdwwzzPPlPacWeNyZmXuE2u1TR36nSdTM5qph7meN2b1ygsDAJX4DCI8rf7lZyu_xa2yHoJoDEfb-H7o3GqzVqvO4UnvbjAeP-t-qFhC3ydl1N7-Y3BKlTeGh2TaPrGWi7xH29JF_vMXjPHfr3REut8WPfq4Sz7HZA-KE_IyaeSBFKmZlY6Roqy8-bzSJfvA7XUEMlEEWabLcKhk4Bsaile6WQP4Vwr1aB-6Exetii6ZDe-fBiPWzE5gb5KrkqWmH3qhPFcid846L3QqkDbvcuWsTizPYgdSOxDAE8mT2Kg0VAcgbQbCGIhPSadYFXBGKPI7ZeiaFNK4hMlSHvuYA8i-DqGMxTnp4ros1jUeY9EuycUf1y_JAYYHE4CwV6RTfmzhOmT20t1UEf0CzIOluA
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1NSwMxEA2iBz2pWPHbHLzuNskmafYoxdJqWzy0WE9lk51FsbSl3V789Wb2o6IgeMoSCLtkAm8m-94bQu5Sn6SnMtOBY6kLpHIsMNpmgRGZi41RJrGodx4MdXcsHydqUonVCy0MABTkMwjxsfiXny7cBq_KmmhEoxnKxvc88CteyrVqfg6Lm_ftXu9FtXzO4is_IcJ6wY_WKQVydA7JsH5nSRj5CDe5Dd3nLzvGf3_UEWl8i_To8xZ-jskOzE_I66AiCFL0zSyYjBSJ5dUBS2bBCi_Y0ZKJopVlMvNDQQRfU5--0vUSwL1RKJv70C29aDFvkHHnYdTuBlX3hOBdMJkHiW75aijLJM-sNdZxlXD0m7eZtEbFhqWRBaEscGCxYHGkZeLzAxAmBa41RKdkd76Ywxmh6OApfN0k0Y-L6zRhkYsYgGgpH8yIn5MG7st0WRpkTOstufhj_pbsd0eD_rTfGz5dkgMMFcIBN1dkN19t4NrjfG5viuh-AaD4qQE
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=2022+10th+International+Conference+on+Affective+Computing+and+Intelligent+Interaction+Workshops+and+Demos+%28ACIIW%29&rft.atitle=Multiple+attention+convolutional-recurrent+neural+networks+for+speech+emotion+recognition&rft.au=Zhang%2C+Zhihao&rft.au=Wang%2C+Kunxia&rft.date=2022-10-18&rft.pub=IEEE&rft.spage=1&rft.epage=8&rft_id=info:doi/10.1109%2FACIIW57231.2022.10086021&rft.externalDocID=10086021