Foreign Accent Conversion using Concentrated Attention

Foreign accent conversion is an important and challenging problem due to significant differences in the manner of articulation and the speech prosody of different regional speakers. In this paper, we propose a new method for the problem of foreign accent conversion that uses Phonetic Posteriorgrams...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE International Conference on Knowledge Graph (ICKG) pp. 386 - 391
Main Authors Zang, Xuexue, Xie, Fei, Weng, Fuliang
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.11.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Foreign accent conversion is an important and challenging problem due to significant differences in the manner of articulation and the speech prosody of different regional speakers. In this paper, we propose a new method for the problem of foreign accent conversion that uses Phonetic Posteriorgrams (PPGs) and Log-scale Fundamental frequency (Log-F0) to address the mismatches of phonetic and prosody. Furthermore, we propose using concentrated attention to improve the alignment of input sequences and mel-spectrograms. The concentrated attention selects the top k highest score values in the attention matrix row by row. In this way, the attention weight of the content related to the current sequence will be the largest. Our approach first trains a PPG extractor using LibriSpeech Corpus, which uses an end-to-end hybrid CTC-attention model. Then, the modified Tacotron2 based on concentrated attention is trained to model the relationships between PPGs and mel-spectrograms. In our proposed framework, the input of Tacotron2 is the concatenation of PPG embedding and normalized Log-scale fundamental frequency (Log-F0). In the convert stage, WaveGlow is modeled to generate speech, which is a streaming structure. To better verify the effectiveness of our proposed method, we also add some objective evaluation methods. These include Mel spectral distance, Object_MOS score, speaker similarity, and similarity in the embedding space of the entire speech. Experiments shows that our proposed concentrated attention method delivers comparable or better results than the previous foreign accent conversion method in terms of voice naturalness, speaker similarity to the source speaker, and accent similarity to the target speaker.
DOI:10.1109/ICKG55886.2022.00056