HiFi-GAN based Text-to-Speech Synthesis in Serbian

In this paper we present a deep neural network based text-to-speech system in the Serbian language, which converts generated acoustic features into a speech signal using the HiFi-GAN vocoder. The HiFi-GAN model was fine-tuned using an existing multi-speaker model trained on an English speech corpus....

Full description

Saved in:
Bibliographic Details
Published in2022 30th European Signal Processing Conference (EUSIPCO) pp. 1178 - 1182
Main Authors Suzic, Sinisa, Pekar, Darko, Secujski, Milan, Nosek, Tijana, Delic, Vlado
Format Conference Proceeding
LanguageEnglish
Published EUSIPCO 29.08.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In this paper we present a deep neural network based text-to-speech system in the Serbian language, which converts generated acoustic features into a speech signal using the HiFi-GAN vocoder. The HiFi-GAN model was fine-tuned using an existing multi-speaker model trained on an English speech corpus. To overcome the problem of inadequate training data, we introduce a data generation technique based on a guided acoustic neural network, which attempts to minimize the mis-match between data used in HiFi-GAN training and inference. The outputs of the acoustic network are intended to represent a trade-off between original feature trajectories and trajectories generated by the standard text-to-speech system. The results of subjective evaluation through listening tests show that the proposed system produces speech whose quality significantly surpasses the quality of speech generated by the best existing speech synthesis for Serbian, and that its MOS score is very close to the score given to natural speech.
ISSN:2076-1465