Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis

In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversati...

Full description

Saved in:
Bibliographic Details
Published inAcoustical Science and Technology Vol. 46; no. 1; pp. 103 - 105
Main Authors Mori, Hiroki, Furukawa, Kota
Format Journal Article
LanguageEnglish
Published Tokyo ACOUSTICAL SOCIETY OF JAPAN 01.01.2025
Japan Science and Technology Agency
Subjects
Online AccessGet full text
ISSN1346-3969
1347-5177
DOI10.1250/ast.e24.35

Cover

Loading…
Abstract In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations.
AbstractList In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations.
ArticleNumber e24.35
Author Mori, Hiroki
Furukawa, Kota
Author_xml – sequence: 1
  fullname: Mori, Hiroki
  organization: School of Engineering, Utsunomiya University
– sequence: 1
  fullname: Furukawa, Kota
  organization: School of Engineering, Utsunomiya University
BookMark eNo9kEtLAzEUhYNUsK1u_AUBd8LUZJJ5ZKcUX1Bwoa5DJnOnM6UmNcko8-9NO7abey_c7xwOZ4YmxhpA6JqSBU0zcqd8WEDKFyw7Q1PKeJFktCgmhztPmMjFBZp5vyEk5SLLp0i_Dya04DuPbYPhB9xQqwFra-LpVeisUVvsdwC6xZXyUGNrcNMZSEJvOrPGv11osYoKt-s9bqw70v7ofInOG7X1cPW_5-jz6fFj-ZKs3p5flw-rRLOiCElV8kZXdV1ypgVrspiccVLlqSIURM0yoGUkOQNRsjwllGgKeUVpqjnVImdzdDP67pz97sEHubG9i_m9ZLEFlpWiTCN1O1LaWe8dNHLnui_lBkmJ3JcoY4kylihZFuH7Ed74oNZwQpULnd7CAeW5pPsxSk4v3SonwbA_Dq5_Hg
Cites_doi 10.1016/j.specom.2017.01.002
10.1527/tjsai.39-3_IDS6-B
10.1109/ACCESS.2022.3214977
10.23919/APSIPAASC55919.2022.9980105
10.1109/ICASSP40776.2020.9053795
ContentType Journal Article
Copyright 2025 by The Acoustical Society of Japan
2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml – notice: 2025 by The Acoustical Society of Japan
– notice: 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID AAYXX
CITATION
7SP
7T9
7U5
8FD
H8D
L7M
DOI 10.1250/ast.e24.35
DatabaseName CrossRef
Electronics & Communications Abstracts
Linguistics and Language Behavior Abstracts (LLBA)
Solid State and Superconductivity Abstracts
Technology Research Database
Aerospace Database
Advanced Technologies Database with Aerospace
DatabaseTitle CrossRef
Aerospace Database
Linguistics and Language Behavior Abstracts (LLBA)
Solid State and Superconductivity Abstracts
Technology Research Database
Advanced Technologies Database with Aerospace
Electronics & Communications Abstracts
DatabaseTitleList Aerospace Database

DeliveryMethod fulltext_linktorsrc
Discipline Physics
EISSN 1347-5177
EndPage 105
ExternalDocumentID 10_1250_ast_e24_35
article_ast_46_1_46_e24_35_article_char_en
GroupedDBID 23M
2WC
5GY
6J9
ACGFO
ACIWK
ALMA_UNASSIGNED_HOLDINGS
CS3
E3Z
EBS
EJD
GX1
JSF
JSH
KQ8
OVT
RJT
RNS
RZJ
TR2
XSB
AAYXX
CITATION
7SP
7T9
7U5
8FD
H8D
L7M
ID FETCH-LOGICAL-c377t-b84fcbdd843c93f5517340b62a01e9d35e18c3743e98362010c1e6b112c41c963
ISSN 1346-3969
IngestDate Mon Jun 30 10:06:53 EDT 2025
Tue Jul 01 01:17:39 EDT 2025
Wed Sep 03 06:30:31 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 1
Language English
License https://creativecommons.org/licenses/by-nd/4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c377t-b84fcbdd843c93f5517340b62a01e9d35e18c3743e98362010c1e6b112c41c963
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
OpenAccessLink https://www.jstage.jst.go.jp/article/ast/46/1/46_e24.35/_article/-char/en
PQID 3177358982
PQPubID 1966373
PageCount 3
ParticipantIDs proquest_journals_3177358982
crossref_primary_10_1250_ast_e24_35
jstage_primary_article_ast_46_1_46_e24_35_article_char_en
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2025/01/01
2025-1-1
20250101
PublicationDateYYYYMMDD 2025-01-01
PublicationDate_xml – month: 01
  year: 2025
  text: 2025/01/01
  day: 01
PublicationDecade 2020
PublicationPlace Tokyo
PublicationPlace_xml – name: Tokyo
PublicationTitle Acoustical Science and Technology
PublicationTitleAlternate Acoustical Science and Technology
PublicationYear 2025
Publisher ACOUSTICAL SOCIETY OF JAPAN
Japan Science and Technology Agency
Publisher_xml – name: ACOUSTICAL SOCIETY OF JAPAN
– name: Japan Science and Technology Agency
References 3) T. Nagata, H. Mori and T. Nose, "Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation," Speech Commun., 88, 137–148 (2017).
6) Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech," Proc. Int. Conf. Learning Representations (ICLR) 2021 (2021).
2) H. Mori and Y. Morimoto, "A listener-aware speech guidance that adaptively changes speech timing," J. Jpn. Soc. Artif. Intell., 39, 1–10 (2024), IDS6-B.
4) H. Mori and H. Nishino, "Neural conversational speech synthesis with flexible control of emotion dimensions," Proc. 2022 APSIPA ASC, pp. 432–436 (2022).
7) R. Sonobe, S. Takamichi and H. Saruwatari, "JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv:1711.00354 (2017).
8) R. Yamamoto, E. Song and J.-M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2020, pp. 6199–6203 (2020).
1) T. Iizuka and H. Mori, "How does a spontaneously speaking conversational agent affect user behavior?" IEEE Access, 10, 111042–111051 (2022).
5) H. Koiso, H. Amatani, Y. Den, Y. Iseki, Y. Ishimoto, W. Kashino, Y. Kawabata, K. Nishikawa, Y. Tanaka, Y. Usuda and Y. Watanabe, "Design and evaluation of the corpus of everyday Japanese conversation," Proc. 13th Language Resources and Evaluation Conf. (LREC) 2022, pp. 5587–5594 (2022).
1
2
3
4
5
6
7
8
References_xml – reference: 3) T. Nagata, H. Mori and T. Nose, "Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation," Speech Commun., 88, 137–148 (2017).
– reference: 1) T. Iizuka and H. Mori, "How does a spontaneously speaking conversational agent affect user behavior?" IEEE Access, 10, 111042–111051 (2022).
– reference: 5) H. Koiso, H. Amatani, Y. Den, Y. Iseki, Y. Ishimoto, W. Kashino, Y. Kawabata, K. Nishikawa, Y. Tanaka, Y. Usuda and Y. Watanabe, "Design and evaluation of the corpus of everyday Japanese conversation," Proc. 13th Language Resources and Evaluation Conf. (LREC) 2022, pp. 5587–5594 (2022).
– reference: 7) R. Sonobe, S. Takamichi and H. Saruwatari, "JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv:1711.00354 (2017).
– reference: 6) Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech," Proc. Int. Conf. Learning Representations (ICLR) 2021 (2021).
– reference: 8) R. Yamamoto, E. Song and J.-M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2020, pp. 6199–6203 (2020).
– reference: 2) H. Mori and Y. Morimoto, "A listener-aware speech guidance that adaptively changes speech timing," J. Jpn. Soc. Artif. Intell., 39, 1–10 (2024), IDS6-B.
– reference: 4) H. Mori and H. Nishino, "Neural conversational speech synthesis with flexible control of emotion dimensions," Proc. 2022 APSIPA ASC, pp. 432–436 (2022).
– ident: 3
  doi: 10.1016/j.specom.2017.01.002
– ident: 5
– ident: 2
  doi: 10.1527/tjsai.39-3_IDS6-B
– ident: 6
– ident: 1
  doi: 10.1109/ACCESS.2022.3214977
– ident: 7
– ident: 4
  doi: 10.23919/APSIPAASC55919.2022.9980105
– ident: 8
  doi: 10.1109/ICASSP40776.2020.9053795
SSID ssj0024956
Score 2.3416352
Snippet In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by...
SourceID proquest
crossref
jstage
SourceType Aggregation Database
Index Database
Publisher
StartPage 103
SubjectTerms Conversation
Conversational agent
Corpus linguistics
Everyday conversation
Japanese language
Linguistics
Prosody
Speech recognition
Speech synthesis
Title Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis
URI https://www.jstage.jst.go.jp/article/ast/46/1/46_e24.35/_article/-char/en
https://www.proquest.com/docview/3177358982
Volume 46
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
ispartofPNX Acoustical Science and Technology, 2025/01/01, Vol.46(1), pp.103-105
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZgAYkL4ikKC7IEN-RSx3YSH1cIqJCWC7tSb1Hs2LCslFR5CC2_nvEjabrlsHCxqnTspJnPns9TzwxCb62xTFmdECWEIVzLjIARNITSNAfKX1mjXTTy6dd0fc6_bMTmWnRJr5b691_jSv5Hq3AN9OqiZP9Bs9OgcAE-g36hBQ1DeyMdf7uqgb_FlCIGHv_KRXX4g-RtN3r5uq0x-sc7Z64q99eABV5J-mHnhS2hR7sdfGKGUbobR56T1xPd-NpfYxxljDXoD7zzp02IX19ftM3lxQSRoR0uy18hBq3py7nDIREzh0NYIxlPCZOhwsrSjNcyImisyBIX1uhbnAMorJJ0xWYGl_q468O1HMiZM1QdyCV8GZKa7CfMvmbIpuOFbmMDvQvoW0Dfgonb6E4C-whX4uLzhu6SMUpf3nf6RTF_LfR9v7vvHmO5-xNI-_dDy-3pyNlD9CDuI_BJAMUjdMvUj9E9f55Xd0-QnqCBG4tHaOB9aOCgbOyhgZsaz6CBHTRwiQM0MEBjlJ6g8RSdf_p49mFNYj0NolmW9UTl3GpVVTlnWjILXDljfKXSpFxRIysmDM1BkjMjc-A1sFPX1KQKGLnmVMNK_Qwd1U1tniNsRZoZkWoYx3CqMiWMkSqT1OSZlaVcoDfjOyu2IW1KcaiVBZLhdU4ycSp5GZ4W1DVBdvrKRSPC5F-g41EDRZySXQFkOGMil3ny4kYP8BLd3yH8GB317WBeAcns1WsPlD-fcITH
linkProvider Geneva Foundation for Medical Education and Research
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Synthesis+of+everyday+conversational+speech+based+on+fine-tuning+with+a+corpus+for+speech+synthesis&rft.jtitle=Acoustical+science+and+technology&rft.au=Mori%2C+Hiroki&rft.au=Furukawa%2C+Kota&rft.date=2025-01-01&rft.issn=1346-3969&rft.eissn=1347-5177&rft.volume=46&rft.issue=1&rft.spage=103&rft.epage=105&rft_id=info:doi/10.1250%2Fast.e24.35&rft.externalDBID=n%2Fa&rft.externalDocID=10_1250_ast_e24_35
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1346-3969&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1346-3969&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1346-3969&client=summon