VOICE QUALITY CONVERSION DEVICE, VOICE QUALITY CONVERSION METHOD, VOICE QUALITY CONVERSION NEURAL NETWORK, PROGRAM, AND RECORDING MEDIUM

This voice quality conversion device 1 comprises an input unit 11 for receiving an input of source speech and speaker information, and a conversion unit 12 for using a trained neural network 100 to convert the quality of the source speech to obtain speech that is in accordance with conversion-destin...

Full description

Saved in:

Bibliographic Details
Main Authors	FUJITA Kazuki, HIROSHIBA Kazuyuki, KITAOKA Shinya
Format	Patent
Language	English French Japanese
Published	01.02.2024
Subjects	ACOUSTICS MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This voice quality conversion device 1 comprises an input unit 11 for receiving an input of source speech and speaker information, and a conversion unit 12 for using a trained neural network 100 to convert the quality of the source speech to obtain speech that is in accordance with conversion-destination speaker information. The neural network 100 comprises: an encoder 110 for receiving speech and outputting a latent expression S1; a flow 120 for converting the latent expression S1 to a speaker-independent latent expression from which a characteristic of the source speaker has been removed while preserving the features of the manner of utterance, and reverse-converting the speaker-independent latent expression to a latent expression S2 by adding a characteristic of the conversion-destination speaker; and a Vocoder 130 for inputting the latent expression S2 and outputting conversion-destination speech. The voice quality conversion device 1 comprises a training unit 13 for training a neural network 100 such that the Vocoder 130 is able to restore a latent expression output by the encoder 110 to the original training speech, and such that the speaker-independent latent expression obtained through a conversion by the flow 120 and an expression created from speaker-independent information output by a text encoder 140 become closer. La présente invention concerne un dispositif de conversion de qualité vocale 1 qui comprend une unité d'entrée 11 pour recevoir une entrée de parole source et d'informations de locuteur, et une unité de conversion 12 pour utiliser un réseau neuronal 100 appris pour convertir la qualité de la parole source afin d'obtenir une parole qui est conforme à des informations de locuteur de destination de conversion. Le réseau neuronal 100 comprend : un codeur 110 pour recevoir la parole et délivrer une expression latente S1 ; un flux 120 pour la conversion de l'expression latente S1 en une expression latente indépendante du locuteur à partir de laquelle une caractéristique du locuteur source a été supprimée tout en préservant les caractéristiques du mode d'énonciation, et la conversion inverse de l'expression latente indépendante du locuteur en une expression latente S2 par ajout d'une caractéristique du locuteur de destination de conversion ; et un vocodeur 130 pour entrer l'expression latente S2 et délivrer une parole de destination de conversion. Le dispositif de conversion de qualité vocale 1 comprend une unité d'apprentissage 13 pour apprendre un réseau neuronal 100 de telle sorte que le vocodeur 130 est apte à restaurer une expression latente délivrée par le codeur 110 en la parole d'apprentissage d'origine, et de telle sorte que l'expression latente indépendante du locuteur obtenue par l'intermédiaire d'une conversion par le flux 120 et une expression créée à partir d'informations indépendantes du locuteur délivrées par un codeur de texte 140 se rapprochent. 声質変換装置１は、変換元の音声と話者情報を入力する入力部１１と、学習済みのニューラルネットワーク１００を利用して、変換元の音声を変換先の話者情報に応じた音声に声質変換する変換部１２を備える。ニューラルネットワーク１００は、音声を入力し、潜在表現Ｓ１を出力するエンコーダ１１０と、潜在表現Ｓ１を発声の仕方の特徴を残しつつ変換元の話者性を取り除いた話者によらない潜在表現に変換し、話者によらない潜在表現を変換先の話者性を付加して潜在表現Ｓ２に逆変換するフロー１２０と、潜在表現Ｓ２を入力して変換先の音声を出力するボコーダ１３０を備える。声質変換装置１は、エンコーダ１１０が出力する潜在表現をボコーダ１３０が元の学習用音声に復元でき、かつ、フロー１２０による変換で得られる話者によらない潜在表現とテキストエンコーダ１４０の出力する話者によらない情報から作った表現とが近くなるようにニューラルネットワーク１００を学習する学習部１３を備える。
Bibliography:	Application Number: WO2023JP27485