Foreign Accent Conversion using Concentrated Attention
Foreign accent conversion is an important and challenging problem due to significant differences in the manner of articulation and the speech prosody of different regional speakers. In this paper, we propose a new method for the problem of foreign accent conversion that uses Phonetic Posteriorgrams...
Saved in:
Published in | 2022 IEEE International Conference on Knowledge Graph (ICKG) pp. 386 - 391 |
---|---|
Main Authors | , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.11.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Foreign accent conversion is an important and challenging problem due to significant differences in the manner of articulation and the speech prosody of different regional speakers. In this paper, we propose a new method for the problem of foreign accent conversion that uses Phonetic Posteriorgrams (PPGs) and Log-scale Fundamental frequency (Log-F0) to address the mismatches of phonetic and prosody. Furthermore, we propose using concentrated attention to improve the alignment of input sequences and mel-spectrograms. The concentrated attention selects the top k highest score values in the attention matrix row by row. In this way, the attention weight of the content related to the current sequence will be the largest. Our approach first trains a PPG extractor using LibriSpeech Corpus, which uses an end-to-end hybrid CTC-attention model. Then, the modified Tacotron2 based on concentrated attention is trained to model the relationships between PPGs and mel-spectrograms. In our proposed framework, the input of Tacotron2 is the concatenation of PPG embedding and normalized Log-scale fundamental frequency (Log-F0). In the convert stage, WaveGlow is modeled to generate speech, which is a streaming structure. To better verify the effectiveness of our proposed method, we also add some objective evaluation methods. These include Mel spectral distance, Object_MOS score, speaker similarity, and similarity in the embedding space of the entire speech. Experiments shows that our proposed concentrated attention method delivers comparable or better results than the previous foreign accent conversion method in terms of voice naturalness, speaker similarity to the source speaker, and accent similarity to the target speaker. |
---|---|
DOI: | 10.1109/ICKG55886.2022.00056 |