Exemplar-Based Emotive Speech Synthesis

Expressive text-to-speech (E-TTS) synthesis is important for enhancing user experience in communication with machines using the speech modality. However, one of the challenges in E-TTS is the lack of a precise description of emotions. Previous categorical specifications may be insufficient for descr...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on audio, speech, and language processing Vol. 29; pp. 874 - 886
Main Authors	Wu, Xixin, Cao, Yuewen, Lu, Hui, Liu, Songxiang, Kang, Shiyin, Wu, Zhiyong, Liu, Xunying, Meng, Helen
Format	Journal Article
Language	English
Published	Piscataway IEEE 2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Acoustics Annotations capsule Emotions exemplary emotion descriptor Expressive speech synthesis Hidden Markov models Recurrent neural networks residual error Spatial data Specifications Spectrogram Speech speech emotion recognition Speech enhancement Speech recognition Speech synthesis User experience Virtual assistants Voice recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Expressive text-to-speech (E-TTS) synthesis is important for enhancing user experience in communication with machines using the speech modality. However, one of the challenges in E-TTS is the lack of a precise description of emotions. Previous categorical specifications may be insufficient for describing complex emotions. The dimensional specifications face the difficulty of ambiguity in annotation. This work advocates a new approach of describing emotive speech acoustics using spoken exemplars. We investigate methods to extract emotion descriptions from the input exemplar of emotive speech. The measures are combined to form two descriptors, based on capsule network (CapNet) and residual error network (RENet). The first is designed to consider the spatial information in the input exemplary spectrogram, and the latter is to capture the contrastive information between emotive acoustic expressions. Two different approaches are applied for conversion from the variable-length feature sequence to fixed-size description vector: (1) dynamic routing groups similar capsules to the output description; and (2) recurrent neural network's hidden states store the temporal information for the description. The two descriptors are integrated to a state-of-the-art sequence-to-sequence architecture to obtain an end-to-end architecture that is optimized as a whole towards the same goal of generating correct emotive speech. Experimental results on a public audiobook dataset demonstrate that the two exemplar-based approaches achieve significant performance improvement over the baseline system in both emotion similarity and speech quality.
ISSN:	2329-9290 2329-9304
DOI:	10.1109/TASLP.2021.3052688