A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech

The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algo...

Full description

Saved in:
Bibliographic Details
Published inElectronics (Basel) Vol. 12; no. 19; p. 4034
Main Authors Kim, Sera, Lee, Seok-Pil
Format Journal Article
LanguageEnglish
Published Basel MDPI AG 01.10.2023
Subjects
Online AccessGet full text

Cover

Loading…
Abstract The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)–Transformer and a 2D convolutional neural network (CNN). The BiLSTM–Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech.
AbstractList The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand and react to human emotions. This study aims to enhance the efficacy of emotion recognition from speech by using dimensionality reduction algorithms for visualization, effectively outlining emotion-specific audio features. As a model for emotion recognition, we propose a new model architecture that combines the bidirectional long short-term memory (BiLSTM)–Transformer and a 2D convolutional neural network (CNN). The BiLSTM–Transformer processes audio features to capture the sequence of speech patterns, while the 2D CNN handles Mel-Spectrograms to capture the spatial details of audio. To validate the proficiency of the model, the 10-fold cross-validation method is used. The methodology proposed in this study was applied to Emo-DB and RAVDESS, two major emotion recognition from speech databases, and achieved high unweighted accuracy rates of 95.65% and 80.19%, respectively. These results indicate that the use of the proposed transformer-based deep learning model with appropriate feature selection can enhance performance in emotion recognition from speech.
Author Kim, Sera
Lee, Seok-Pil
Author_xml – sequence: 1
  givenname: Sera
  surname: Kim
  fullname: Kim, Sera
– sequence: 2
  givenname: Seok-Pil
  orcidid: 0000-0003-2520-6681
  surname: Lee
  fullname: Lee, Seok-Pil
BookMark eNptUMtOAkEQnBhMROQLvEzieXUe-5ojIj4SxATwvJltemUJO4M9y8Gb_-Af-iWu4sGDdelKulKVqlPWc94hY-dSXGptxBVuEVryroYglTSx0PER6yuRmcgoo3p_-AkbhrARHYzUuRZ9Nh_x63q6WD5-vn8sybpQeWqQuHUrrm74eDbjI4J13XYZe0Levfmk8W3tHZ8j-BdX__CKfMMXO0RYn7Hjym4DDn_vgD3fTpbj-2j6dPcwHk0j0Eq1kY7tKhEGAQyIVFZ5qaw0Ns3SKpNGAJQ2FyouIbOYJHEu0tJkMllJq1IoUeoBuzj47si_7jG0xcbvyXWRhcqzVBvV9e5U-qAC8iEQVsWO6sbSWyFF8b1f8c9--gsZNmf1
CitedBy_id crossref_primary_10_3390_electronics12234859
crossref_primary_10_3390_electronics12234779
crossref_primary_10_1016_j_apacoust_2024_109886
Cites_doi 10.3844/jcssp.2018.1577.1587
10.1016/j.bspc.2020.101894
10.1109/ICREST.2019.8644168
10.3390/s20226688
10.1007/978-3-319-70772-3_1
10.1016/j.neucom.2023.01.002
10.1371/journal.pone.0196391
10.3390/s18020401
10.1109/ICACDOT.2016.7877753
10.1016/j.dsp.2012.05.007
10.18653/v1/P19-1285
10.1016/j.ins.2021.10.005
10.1109/WASPAA.2013.6701819
10.1109/JSTSP.2011.2112333
10.1109/ACCESS.2022.3163856
10.3390/s20185212
10.1109/APSIPA.2016.7820699
10.5772/intechopen.84856
10.1109/TASSP.1980.1163420
10.1016/j.apacoust.2020.107360
10.1109/ICCE53296.2022.9730534
10.21437/Interspeech.2019-2753
ContentType Journal Article
Copyright 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
7SP
8FD
8FE
8FG
ABUWG
AFKRA
ARAPS
AZQEC
BENPR
BGLVJ
CCPQU
DWQXO
HCIFZ
L7M
P5Z
P62
PIMPY
PQEST
PQQKQ
PQUKI
DOI 10.3390/electronics12194034
DatabaseName CrossRef
Electronics & Communications Abstracts
Technology Research Database
ProQuest SciTech Collection
ProQuest Technology Collection
ProQuest Central (Alumni)
ProQuest Central
Advanced Technologies & Aerospace Database‎ (1962 - current)
ProQuest Central Essentials
ProQuest Central
Technology Collection
ProQuest One Community College
ProQuest Central
SciTech Premium Collection (Proquest) (PQ_SDU_P3)
Advanced Technologies Database with Aerospace
Advanced Technologies & Aerospace Database
ProQuest Advanced Technologies & Aerospace Collection
Publicly Available Content Database
ProQuest One Academic Eastern Edition (DO NOT USE)
ProQuest One Academic
ProQuest One Academic UKI Edition
DatabaseTitle CrossRef
Publicly Available Content Database
Advanced Technologies & Aerospace Collection
Technology Collection
Technology Research Database
ProQuest Advanced Technologies & Aerospace Collection
ProQuest Central Essentials
ProQuest One Academic Eastern Edition
Electronics & Communications Abstracts
ProQuest Central (Alumni Edition)
SciTech Premium Collection
ProQuest One Community College
ProQuest Technology Collection
ProQuest SciTech Collection
ProQuest Central
Advanced Technologies & Aerospace Database
ProQuest One Academic UKI Edition
ProQuest Central Korea
ProQuest One Academic
Advanced Technologies Database with Aerospace
DatabaseTitleList CrossRef
Publicly Available Content Database
Database_xml – sequence: 1
  dbid: 8FG
  name: ProQuest Technology Collection
  url: https://search.proquest.com/technologycollection1
  sourceTypes: Aggregation Database
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 2079-9292
ExternalDocumentID 10_3390_electronics12194034
GroupedDBID 5VS
8FE
8FG
AAYXX
AFKRA
ALMA_UNASSIGNED_HOLDINGS
ARAPS
BENPR
BGLVJ
CCPQU
CITATION
GROUPED_DOAJ
HCIFZ
IAO
ITC
KQ8
MODMG
M~E
OK1
P62
PIMPY
PROAC
7SP
8FD
ABUWG
AZQEC
DWQXO
L7M
PQEST
PQQKQ
PQUKI
ID FETCH-LOGICAL-c322t-34ad509ecc9c061f8b2a19a676f7190ccba8024bc7ae554806b9715d1a26cbe13
IEDL.DBID 8FG
ISSN 2079-9292
IngestDate Sat Nov 09 11:39:38 EST 2024
Fri Aug 23 02:37:09 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 19
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c322t-34ad509ecc9c061f8b2a19a676f7190ccba8024bc7ae554806b9715d1a26cbe13
ORCID 0000-0003-2520-6681
OpenAccessLink https://www.proquest.com/docview/2876392207?pq-origsite=%requestingapplication%
PQID 2876392207
PQPubID 2032404
ParticipantIDs proquest_journals_2876392207
crossref_primary_10_3390_electronics12194034
PublicationCentury 2000
PublicationDate 2023-10-01
PublicationDateYYYYMMDD 2023-10-01
PublicationDate_xml – month: 10
  year: 2023
  text: 2023-10-01
  day: 01
PublicationDecade 2020
PublicationPlace Basel
PublicationPlace_xml – name: Basel
PublicationTitle Electronics (Basel)
PublicationYear 2023
Publisher MDPI AG
Publisher_xml – name: MDPI AG
References Burkhardt (ref_4) 2005; 5
Chen (ref_28) 2012; 22
Praseetha (ref_34) 2018; 14
ref_13
ref_35
ref_12
ref_11
Peeters (ref_37) 2004; 54
ref_10
ref_32
ref_31
Daneshfar (ref_30) 2020; 166
Radford (ref_14) 2019; 1
ref_19
ref_18
ref_16
ref_38
ref_15
Muller (ref_36) 2011; 5
Hinton (ref_39) 2008; 9
Milton (ref_33) 2013; 69
ref_25
ref_23
ref_22
ref_21
ref_1
ref_3
ref_29
Jing (ref_20) 2021; 37
ref_26
(ref_5) 2023; 528
ref_9
Brown (ref_17) 2020; 33
ref_8
Canal (ref_2) 2022; 582
ref_7
Andayani (ref_24) 2022; 10
Davis (ref_27) 1980; 28
ref_6
References_xml – volume: 14
  start-page: 1577
  year: 2018
  ident: ref_34
  article-title: Deep learning models for speech emotion recognition
  publication-title: J. Comput. Sci.
  doi: 10.3844/jcssp.2018.1577.1587
  contributor:
    fullname: Praseetha
– volume: 54
  start-page: 1
  year: 2004
  ident: ref_37
  article-title: A large set of audio features for sound description (similarity and classification) in the CUIDADO project
  publication-title: CUIDADO Ist Proj. Rep.
  contributor:
    fullname: Peeters
– ident: ref_7
  doi: 10.1016/j.bspc.2020.101894
– volume: 33
  start-page: 1877
  year: 2020
  ident: ref_17
  article-title: Language models are few-shot learners
  publication-title: Adv. Neural Inf. Process. Syst.
  contributor:
    fullname: Brown
– ident: ref_32
– ident: ref_3
– ident: ref_26
– volume: 9
  start-page: 2579
  year: 2008
  ident: ref_39
  article-title: Visualizing data using t-SNE
  publication-title: J. Mach. Learn. Res.
  contributor:
    fullname: Hinton
– ident: ref_35
  doi: 10.1109/ICREST.2019.8644168
– ident: ref_16
– ident: ref_19
  doi: 10.3390/s20226688
– volume: 1
  start-page: 9
  year: 2019
  ident: ref_14
  article-title: Language models are unsupervised multitask learners
  publication-title: OpenAI blog
  contributor:
    fullname: Radford
– ident: ref_18
– ident: ref_31
  doi: 10.1007/978-3-319-70772-3_1
– ident: ref_23
– volume: 37
  start-page: 164
  year: 2021
  ident: ref_20
  article-title: Transformer-like model with linear attention for speech emotion recognition
  publication-title: J. Southeast Univ.
  contributor:
    fullname: Jing
– volume: 69
  start-page: 34
  year: 2013
  ident: ref_33
  article-title: SVM scheme for speech emotion recognition using MFCC feature
  publication-title: Int. J. Comput. Appl.
  contributor:
    fullname: Milton
– volume: 528
  start-page: 1
  year: 2023
  ident: ref_5
  article-title: An ongoing review of speech emotion recognition
  publication-title: Neurocomputing
  doi: 10.1016/j.neucom.2023.01.002
– volume: 5
  start-page: 1517
  year: 2005
  ident: ref_4
  article-title: A database of German emotional speech
  publication-title: Interspeech
  contributor:
    fullname: Burkhardt
– ident: ref_6
– ident: ref_25
  doi: 10.1371/journal.pone.0196391
– ident: ref_1
  doi: 10.3390/s18020401
– ident: ref_29
  doi: 10.1109/ICACDOT.2016.7877753
– volume: 22
  start-page: 1154
  year: 2012
  ident: ref_28
  article-title: Speech emotion recognition: Features and classification models
  publication-title: Digit. Signal Process.
  doi: 10.1016/j.dsp.2012.05.007
  contributor:
    fullname: Chen
– ident: ref_21
  doi: 10.18653/v1/P19-1285
– volume: 582
  start-page: 593
  year: 2022
  ident: ref_2
  article-title: A survey on facial emotion recognition techniques: A state-of-the-art literature review
  publication-title: Inf. Sci.
  doi: 10.1016/j.ins.2021.10.005
  contributor:
    fullname: Canal
– ident: ref_38
  doi: 10.1109/WASPAA.2013.6701819
– volume: 5
  start-page: 1088
  year: 2011
  ident: ref_36
  article-title: Signal processing for music analysis
  publication-title: IEEE J. Sel. Top. Signal Process.
  doi: 10.1109/JSTSP.2011.2112333
  contributor:
    fullname: Muller
– volume: 10
  start-page: 36018
  year: 2022
  ident: ref_24
  article-title: Hybrid LSTM-transformer model for emotion recognition from speech audio files
  publication-title: IEEE Access
  doi: 10.1109/ACCESS.2022.3163856
  contributor:
    fullname: Andayani
– ident: ref_15
– ident: ref_13
– ident: ref_10
  doi: 10.3390/s20185212
– ident: ref_9
  doi: 10.1109/APSIPA.2016.7820699
– ident: ref_22
– ident: ref_12
  doi: 10.5772/intechopen.84856
– volume: 28
  start-page: 357
  year: 1980
  ident: ref_27
  article-title: Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences
  publication-title: IEEE Trans. Acoust. Speech Signal Process.
  doi: 10.1109/TASSP.1980.1163420
  contributor:
    fullname: Davis
– volume: 166
  start-page: 107360
  year: 2020
  ident: ref_30
  article-title: Speech emotion recognition using hybrid spectral-prosodic features of speech signal/glottal waveform, metaheuristic-based dimensionality reduction, and Gaussian elliptical basis function network classifier
  publication-title: Appl. Acoust.
  doi: 10.1016/j.apacoust.2020.107360
  contributor:
    fullname: Daneshfar
– ident: ref_8
  doi: 10.1109/ICCE53296.2022.9730534
– ident: ref_11
  doi: 10.21437/Interspeech.2019-2753
SSID ssj0000913830
Score 2.3400433
Snippet The significance of emotion recognition technology is continuing to grow, and research in this field enables artificial intelligence to accurately understand...
SourceID proquest
crossref
SourceType Aggregation Database
StartPage 4034
SubjectTerms Accuracy
Algorithms
Artificial intelligence
Artificial neural networks
Deep learning
Emotion recognition
Emotions
Machine learning
Neural networks
Spectrograms
Speech
Speech recognition
Transformers
Title A BiLSTM–Transformer and 2D CNN Architecture for Emotion Recognition from Speech
URI https://www.proquest.com/docview/2876392207
Volume 12
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV07T8MwELagXWBAPEWhVB4Ysdo4iR1PqC0pFaIR6kPqFvkpWNLSlhXxH_iH_BLsJoFWQmyJLHm4093nO5-_D4BrIozAgVJIKkxQwAxDwnCOVCRpxLUikXLvnQcJ6U-Ch2k4LRpuy2KsssyJ60StZtL1yJvYUacxjFv0dv6KnGqUu10tJDR2QdXDlLriK-rd__RYHOdl5LdysiHfVvfNX22ZpWdjNWj5wTYgbefjNcj0DsFBcTqE7dydR2BHZ8dgf4Mz8AQM27Dz8jgaD74-PsflqVMvIM8UxHewmySwvXE5AO0yjHOtHjgsp4Xst3tXAkdzreXzKZj04nG3jwphBCRt_K2QH3Blgd5an0mLxyYSmHuME0oMtQAvpeCRxV4hKdehI3QjglEvVB7HRArt-Wegks0yfQ4g5ZwZo6jS2gShpoJwHuiW3YhaAylWAzelddJ5zn-R2rrBGTP9w5g1UC8tmBbBsEx_XXfx__Il2HNq7vmsXB1UVos3fWUxfyUaa8c2QLUTJ09D-zd4j78B_yuytg
link.rule.ids 315,783,787,12777,21400,27936,27937,33385,33756,43612,43817,74363,74630
linkProvider ProQuest
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV27TsMwFLWgDMCAeIpCAQ-MWG0Sx44nVEpLgTZDm0pskZ-CJS1t2fkH_pAvwW4S2kqILZKlDOfq-tjX954DwDURRvhYKSSVTxBmhiFhOEcqkjTiWpFIuXnnfky6I_z0Er4UBbdZ0VZZ7omLjVqNpauR130nncZ8v0FvJ-_IuUa519XCQmMTbOHAcrWbFO88_NZYnOZlFDRysaHA3u7rS2-ZmWdzFTcCvE5I6_vxgmQ6-2CvOB3CZh7OA7Chs0Owu6IZeAQGTXj31hsm_e_Pr6Q8deop5JmC_j1sxTFsrjwOQLsM27lXDxyU3UL2282VwOFEa_l6DEaddtLqosIYAUmbf3MUYK4s0Vv0mbR8bCLhc49xQomhluClFDyy3Csk5Tp0gm5EMOqFyuM-kUJ7wQmoZONMnwJIOWfGKKq0NjjUVBDOsW7YH1ELkGJVcFOik05y_YvU3hscmOkfYFZBrUQwLZJhli5Dd_b_8hXY7ib9Xtp7jJ_PwY5zds_75mqgMp9-6AvL_3NxuQjyD6Bbss0
linkToPdf http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3LTgIxFG0UE6ML4zOiqF24tGGe7XRlEBhRYWJ4JO4mfUY3AwLu_Qf_0C-xZWYEEuOuSZMu7u2957a9PQeAa8w19wIpkZAeRgHVFHHNGJKRIBFTEkfS_nfuJbgzCh5fwpei_2lWtFWWOXGRqOVY2Dvyumep06jnOaSui7aI51Z8O3lHVkHKvrQWchqbYMugIrZ7Porvf-9bLP9l5Ds58ZBvTvr1pc7MzDVxGzh-sA5O67l5ATjxPtgrKkXYyF17ADZUdgh2V_gDj0C_Ae_euoNh7_vza1hWoGoKWSah14LNJIGNlYcCaKZhO9ftgf2yc8iM7R8TOJgoJV6PwShuD5sdVIgkIGFicY78gEkD-sYTVBhs1hH3mEsZJlgTA_ZCcBYZHOaCMBVacjfMKXFD6TIPC65c_wRUsnGmTgEkjFGtJZFK6SBUhGPGAuWYhYgxkKRVcFNaJ53kXBipOUNYY6Z_GLMKaqUF0yIwZunSjWf_T1-BbePftPuQPJ2DHSvynrfQ1UBlPv1QF6YUmPPLhY9_AKA3twU
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=A+BiLSTM%E2%80%93Transformer+and+2D+CNN+Architecture+for+Emotion+Recognition+from+Speech&rft.jtitle=Electronics+%28Basel%29&rft.au=Kim%2C+Sera&rft.au=Lee%2C+Seok-Pil&rft.date=2023-10-01&rft.issn=2079-9292&rft.eissn=2079-9292&rft.volume=12&rft.issue=19&rft.spage=4034&rft_id=info:doi/10.3390%2Felectronics12194034&rft.externalDBID=n%2Fa&rft.externalDocID=10_3390_electronics12194034
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=2079-9292&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=2079-9292&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=2079-9292&client=summon