BLSTM and CNN Stacking Architecture for Speech Emotion Recognition

Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving...

Full description

Saved in:
Bibliographic Details
Published inNeural processing letters Vol. 53; no. 6; pp. 4097 - 4115
Main Authors Li, Dongdong, Sun, Linyu, Xu, Xinlei, Wang, Zhe, Zhang, Jing, Du, Wenli
Format Journal Article
LanguageEnglish
Published New York Springer US 01.12.2021
Springer Nature B.V
Subjects
Online AccessGet full text
ISSN1370-4621
1573-773X
DOI10.1007/s11063-021-10581-z

Cover

More Information
Summary:Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving time series acoustic features and Convolutional Neural Network (CNN) can discover the local structure among different features. This paper proposed the BLSTM and CNN Stacking Architecture (BCSA) to enhance the ability to recognition emotions. In order to match the input formats of BLSTM and CNN, slicing feature matrices is necessary. For utilizing the different roles of the BLSTM and CNN, the Stacking is employed to integrate the BLSTM and CNN. In detail, taking into account overfitting problem, the estimates of probabilistic quantities from BLSTM and CNN are combined as new data using K-fold cross validation. Finally, based on the Stacking models, the logistic regression is used to recognize emotions effectively by fitting the new data. The experiment results demonstrate that the performance of proposed architecture is better than that of single model. Furthermore, compared with the state-of-the-art model on SER in our knowledge, the proposed method BCSA may be more suitable for SER by integrating time series acoustic features and the local structure among different features.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1370-4621
1573-773X
DOI:10.1007/s11063-021-10581-z