Combining cross-modal knowledge transfer and semi-supervised learning for speech emotion recognition

Speech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation cost and the inherent ambiguity in emotion labels. The recent emerg...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge-based systems Vol. 229; p. 107340
Main Authors	Zhang, Sheng, Chen, Min, Chen, Jincai, Li, Yuan-Fang, Wu, Yiling, Li, Minglei, Zhu, Chuanbo
Format	Journal Article
Language	English
Published	Amsterdam Elsevier B.V 11.10.2021 Elsevier Science Ltd
Subjects	Algorithms Annotations Audio data Cross-modal knowledge transfer Datasets Emotion recognition Emotions Knowledge management Labels Machine learning Regularization Semi-supervised learning Speech emotion recognition Speech recognition Teaching methods Video data Semi-supervised learning Cross-modal knowledge transfer Speech emotion recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech emotion recognition is an important task with a wide range of applications. However, the progress of speech emotion recognition is limited by the lack of large, high-quality labeled speech datasets, due to the high annotation cost and the inherent ambiguity in emotion labels. The recent emergence of large-scale video data makes it possible to obtain massive, though unlabeled speech data. To exploit this unlabeled data, previous works have explored semi-supervised learning methods on various tasks. However, noisy pseudo-labels remain a challenge for these methods. In this work, to alleviate the above issue, we propose a new architecture that combines cross-modal knowledge transfer from visual to audio modality into our semi-supervised learning method with consistency regularization. We posit that introducing visual emotional knowledge by the cross-modal transfer method can increase the diversity and accuracy of pseudo-labels and improve the robustness of the model. To combine knowledge from cross-modal transfer and semi-supervised learning, we design two fusion algorithms, i.e. weighted fusion and consistent & random. Our experiments on CH-SIMS and IEMOCAP datasets show that our method can effectively use additional unlabeled audio-visual data to outperform state-of-the-art results.
ISSN:	0950-7051 1872-7409
DOI:	10.1016/j.knosys.2021.107340