BLSTM and CNN Stacking Architecture for Speech Emotion Recognition

Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving...

Full description

Saved in:

Bibliographic Details
Published in	Neural processing letters Vol. 53; no. 6; pp. 4097 - 4115
Main Authors	Li, Dongdong, Sun, Linyu, Xu, Xinlei, Wang, Zhe, Zhang, Jing, Du, Wenli
Format	Journal Article
Language	English
Published	New York Springer US 01.12.2021 Springer Nature B.V
Subjects	Algorithms Artificial Intelligence Artificial neural networks Classification Complex Systems Computational Intelligence Computer Science Deep learning Emotion recognition Emotions Machine learning Neural networks Speech Speech recognition Statistical analysis Support vector machines Time series Stacking Convolutional neural network Bidirectional long short term memory Speech emotion recognition
Online Access	Get full text
ISSN	1370-4621 1573-773X
DOI	10.1007/s11063-021-10581-z

Cover

More Information
Summary:	Speech Emotion Recognition (SER) is a huge challenge for distinguishing and interpreting the sentiments carried in speech. Fortunately, deep learning is proved to have great ability to deal with acoustic features. For instance, Bidirectional Long Short Term Memory (BLSTM) has an advantage of solving time series acoustic features and Convolutional Neural Network (CNN) can discover the local structure among different features. This paper proposed the BLSTM and CNN Stacking Architecture (BCSA) to enhance the ability to recognition emotions. In order to match the input formats of BLSTM and CNN, slicing feature matrices is necessary. For utilizing the different roles of the BLSTM and CNN, the Stacking is employed to integrate the BLSTM and CNN. In detail, taking into account overfitting problem, the estimates of probabilistic quantities from BLSTM and CNN are combined as new data using K-fold cross validation. Finally, based on the Stacking models, the logistic regression is used to recognize emotions effectively by fitting the new data. The experiment results demonstrate that the performance of proposed architecture is better than that of single model. Furthermore, compared with the state-of-the-art model on SER in our knowledge, the proposed method BCSA may be more suitable for SER by integrating time series acoustic features and the local structure among different features.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1370-4621 1573-773X
DOI:	10.1007/s11063-021-10581-z