Exploring Zero-Shot Emotion Recognition in Speech Using Semantic-Embedding Prototypes

Speech Emotion Recognition (SER) makes it possible for machines to perceive affective information. Our previous research differed from conventional SER endeavours in that it focused on recognising unseen emotions in speech autonomously through machine learning. Such a step would enable the automatic...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 24; pp. 2752 - 2765
Main Authors	Xu, Xinzhou, Deng, Jun, Cummins, Nicholas, Zhang, Zixing, Zhao, Li, Schuller, Bjorn W.
Format	Journal Article
Language	English
Published	Piscataway IEEE 2022 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Annotations Electronic mail Embedding Emotion recognition Emotional factors Emotions Machine learning Optimization paralinguistics Predictive models Prototypes semantic-embedding prototypes Semantics Speech Speech emotion recognition Speech recognition Training zero-shot learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech Emotion Recognition (SER) makes it possible for machines to perceive affective information. Our previous research differed from conventional SER endeavours in that it focused on recognising unseen emotions in speech autonomously through machine learning. Such a step would enable the automatic leaning of unknown emerging emotional states. This type of learning framework, however, still relied on manual annotations to obtain multiple samples of each emotion. In order to reduce this additional workload, herein, we propose a zero-shot SER framework employing a per-emotion semantic-embedding paradigm to describe emotions in zero-shot SER, instead of using the sample-wise descriptors. Aiming to optimise the relationship between emotions, prototypes, and speech samples, this framework includes two types of learning strategies: Sample-wise learning and emotion-wise learning. These strategies apply a novel learning process to speech samples and emotions, respectively, via specifically designed semantic-embedding prototypes. We verify the utility of these approaches by performing an extensive experimental evaluation on two corpora on three aspects, namely the influence of different types of learning strategies, emotional-pair comparison, and the selections of semantic-embedding prototypes and paralinguistic features. The experimental results indicate that it is applicable to use semantic-embedding prototypes for zero-shot emotion recognition in speech, despite the influence of choosing optimal strategies and prototypes.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2021.3087098