Speech emotion recognition with unsupervised feature learning

Emotion-based features are critical for achieving high performance in a speech emotion recognition （SER） system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms （including K-means...

Full description

Saved in:

Bibliographic Details
Published in	Frontiers of information technology & electronic engineering Vol. 16; no. 5; pp. 358 - 366
Main Authors	Huang, Zheng-wei, Xue, Wen-tao, Mao, Qi-rong
Format	Journal Article
Language	English
Published	Hangzhou Zhejiang University Press 01.05.2015 Springer Nature B.V
Subjects	Algorithms Cluster analysis Clustering Cognitive tasks Communications Engineering Computer Hardware Computer Science Computer Systems Organization and Communication Networks Electrical Engineering Electronic engineering Electronics and Microelectronics Emotion recognition Emotions Feature recognition Information technology Instrumentation Learning Machine learning Networks Nodes Performance enhancement Performance evaluation Speech Speech recognition Vector quantization Unsupervised feature learning Neural network Affect computing Speech emotion recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Emotion-based features are critical for achieving high performance in a speech emotion recognition （SER） system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms （including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines）, which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network.
Bibliography:	Emotion-based features are critical for achieving high performance in a speech emotion recognition （SER） system. In general, it is difficult to develop these features due to the ambiguity of the ground-truth. In this paper, we apply several unsupervised feature learning algorithms （including K-means clustering, the sparse auto-encoder, and sparse restricted Boltzmann machines）, which have promise for learning task-related features by using unlabeled data, to speech emotion recognition. We then evaluate the performance of the proposed approach and present a detailed analysis of the effect of two important factors in the model setup, the content window size and the number of hidden layer nodes. Experimental results show that larger content windows and more hidden nodes contribute to higher performance. We also show that the two-layer network cannot explicitly improve performance compared to a single-layer network. Speech emotion recognition, Unsupervised feature learning, Neural network, Affect computing 33-1389/TP ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	2095-9184 2095-9230
DOI:	10.1631/FITEE.1400323