Recognizing emotion from singing and speaking using shared models

Speech and song are two types of vocal communications that are closely related to each other. While significant progress has been made in both speech and music emotion recognition, few works have concentrated on building a shared emotion recognition model for both speech and song. In this paper, we...

Full description

Saved in:

Bibliographic Details
Published in	International Conference on Affective Computing and Intelligent Interaction and workshops pp. 139 - 145
Main Authors	Biqiao Zhang, Essl, Georg, Provost, Emily Mower
Format	Conference Proceeding Journal Article
Language	English
Published	IEEE 01.09.2015
Subjects	Acoustics Activation Classification Construction Emotion recognition Emotions Feature extraction multi-task learning Music Singing speaking Speech Speech recognition Support vector machines Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech and song are two types of vocal communications that are closely related to each other. While significant progress has been made in both speech and music emotion recognition, few works have concentrated on building a shared emotion recognition model for both speech and song. In this paper, we propose three shared emotion recognition models for speech and song: a simple model, a single-task hierarchical model, and a multi-task hierarchical model. We study the commonalities and differences present in emotion expression across these two communication domains. We compare the performance across different settings, investigate the relationship between evaluator agreement rate and classification accuracy, and analyze the classification performance of individual feature groups. Our results show that the multi-task model classifies emotion more accurately compared to single-task models when the same set of features is used. This suggests that although spoken and sung emotion recognition tasks are different, they are related, and can be considered together. The results demonstrate that utterances with lower agreement rate and emotions with low activation benefit the most from multi-task learning. Visual features appear to be more similar across spoken and sung emotion expression, compared to acoustic features.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Conference-1 ObjectType-Feature-3 content type line 23 SourceType-Conference Papers & Proceedings-2
ISSN:	2156-8111
DOI:	10.1109/ACII.2015.7344563