Mongolian emotional speech synthesis based on transfer learning and emotional embedding

In recent years, end-to-end speech synthesis based on attention has achieved better performance than traditional speech synthesis models, and the technology of end-to-end Mongolian speech synthesis has reached the application standard. However, due to the sparse training corpus, the research on Mong...

Full description

Saved in:

Bibliographic Details
Published in	2021 International Conference on Asian Language Processing (IALP) pp. 78 - 83
Main Authors	Huang, Aihong, Bao, Feilong, Gao, Guanglai, Shan, Yu, Liu, Rui
Format	Conference Proceeding
Language	English
Published	IEEE 11.12.2021
Subjects	Controllability Emotional embedding Emotional speech synthesis End-to-End Speech enhancement Speech synthesis Traditional Mongolian Training Transfer learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In recent years, end-to-end speech synthesis based on attention has achieved better performance than traditional speech synthesis models, and the technology of end-to-end Mongolian speech synthesis has reached the application standard. However, due to the sparse training corpus, the research on Mongolian emotional speech synthesis is still far from perfect. In response to these problems, we established a Mongolian emotional corpus and constructed an emotionally controllable Mongolian speech synthesis system for the first time. Through combining transfer learning and emotional embedding, the Mongolian emotional speech synthesis system with 8 kinds of emotions (happy, angry, sadness, surprise, fear, disgust, boredom and neutral) has been achieved. We proposed the method that emotional labels are used as the input of the emotional embedding layer to generate emotional vectors, which are spliced with the output vectors of the bidirectional LSTM layer, so that the text representation vectors contain information about emotional category, thereby synthesize a variety of different emotional voices. Experiments show that our method can synthesize high-quality Mongolian emotional speech.
DOI:	10.1109/IALP54817.2021.9675192