Mongolian emotional speech synthesis based on transfer learning and emotional embedding

In recent years, end-to-end speech synthesis based on attention has achieved better performance than traditional speech synthesis models, and the technology of end-to-end Mongolian speech synthesis has reached the application standard. However, due to the sparse training corpus, the research on Mong...

Full description

Saved in:
Bibliographic Details
Published in2021 International Conference on Asian Language Processing (IALP) pp. 78 - 83
Main Authors Huang, Aihong, Bao, Feilong, Gao, Guanglai, Shan, Yu, Liu, Rui
Format Conference Proceeding
LanguageEnglish
Published IEEE 11.12.2021
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In recent years, end-to-end speech synthesis based on attention has achieved better performance than traditional speech synthesis models, and the technology of end-to-end Mongolian speech synthesis has reached the application standard. However, due to the sparse training corpus, the research on Mongolian emotional speech synthesis is still far from perfect. In response to these problems, we established a Mongolian emotional corpus and constructed an emotionally controllable Mongolian speech synthesis system for the first time. Through combining transfer learning and emotional embedding, the Mongolian emotional speech synthesis system with 8 kinds of emotions (happy, angry, sadness, surprise, fear, disgust, boredom and neutral) has been achieved. We proposed the method that emotional labels are used as the input of the emotional embedding layer to generate emotional vectors, which are spliced with the output vectors of the bidirectional LSTM layer, so that the text representation vectors contain information about emotional category, thereby synthesize a variety of different emotional voices. Experiments show that our method can synthesize high-quality Mongolian emotional speech.
DOI:10.1109/IALP54817.2021.9675192