LEARNING DEVICE AND PROGRAM FOR LEARNING STATISTICAL MODEL USED FOR VOICE SYNTHESIS

To generate a statistical model for obtaining a synthesized voice signal of stable quality when learning is performed using a language feature amount including information on a pose.SOLUTION: An associating unit 13 of the learning device 1 temporally associates the language feature amount for each p...

Full description

Saved in:

Bibliographic Details
Main Authors	TSUGI TORU, KUMANO TADASHI, KURIHARA KIYOSHI, IMAI ATSUSHI, SEIYAMA NOBUMASA
Format	Patent
Language	English Japanese
Published	26.03.2020
Subjects	ACOUSTICS MUSICAL INSTRUMENTS PHYSICS SPEECH ANALYSIS OR SYNTHESIS SPEECH OR AUDIO CODING OR DECODING SPEECH OR VOICE PROCESSING SPEECH RECOGNITION
Online Access	Get full text

Cover

Loading…

More Information
Summary:	To generate a statistical model for obtaining a synthesized voice signal of stable quality when learning is performed using a language feature amount including information on a pose.SOLUTION: An associating unit 13 of the learning device 1 temporally associates the language feature amount for each phoneme with the acoustic feature amount for each frame. A pose changing unit 14 changes a pose length indicated by pose information included in the temporally associated language feature amount, and generates a pause-changed language feature amount which reflects the changed pose length. Further, the pose changing unit 14 generates a pause-changed acoustic feature amount reflecting the changed pose length. A learning unit 15 treats a set of the language feature amount and the acoustic feature amount temporally associated with each other by the associating unit 13 as learning data, and also treats a set of the pause-changed language feature amount and the acoustic feature amount generated by the pause changing unit 14 as learning data, and learns a statistical model.SELECTED DRAWING: Figure 1 【課題】ポーズに関する情報を含む言語特徴量を用いて学習を行う際に、安定的な品質の合成音声信号を得るための統計モデルを生成する。【解決手段】学習装置１の対応付け部１３は、音素毎の言語特徴量とフレーム毎の音響特徴量とを時間的に対応付ける。ポーズ変更部１４は、時間的に対応付けられた言語特徴量に含まれるポーズ情報の示すポーズ長を変更し、変更後のポーズ長を反映したポーズ変更後の言語特徴量を生成する。また、ポーズ変更部１４は、変更後のポーズ長を反映したポーズ変更後の音響特徴量を生成する。学習部１５は、対応付け部１３により時間的に対応付けられた言語特徴量及び音響特徴量の組を学習データとして扱うと共に、ポーズ変更部１４により生成されたポーズ変更後の言語特徴量及び音響特徴量の組も学習データとして扱い、統計モデルを学習する。【選択図】図１
Bibliography:	Application Number: JP20180175221