Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis

In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversati...

Full description

Saved in:

Bibliographic Details
Published in	Acoustical Science and Technology Vol. 46; no. 1; pp. 103 - 105
Main Authors	Mori, Hiroki, Furukawa, Kota
Format	Journal Article
Language	English
Published	Tokyo ACOUSTICAL SOCIETY OF JAPAN 01.01.2025 Japan Science and Technology Agency
Subjects	Conversation Conversational agent Corpus linguistics Everyday conversation Japanese language Linguistics Prosody Speech recognition Speech synthesis
Online Access	Get full text
ISSN	1346-3969 1347-5177
DOI	10.1250/ast.e24.35

Cover

Loading…

Abstract	In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations.
AbstractList	In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations.
ArticleNumber	e24.35
Author	Mori, Hiroki Furukawa, Kota
Author_xml	– sequence: 1 fullname: Mori, Hiroki organization: School of Engineering, Utsunomiya University – sequence: 1 fullname: Furukawa, Kota organization: School of Engineering, Utsunomiya University
BookMark	eNo9kEtLAzEUhYNUsK1u_AUBd8LUZJJ5ZKcUX1Bwoa5DJnOnM6UmNcko8-9NO7abey_c7xwOZ4YmxhpA6JqSBU0zcqd8WEDKFyw7Q1PKeJFktCgmhztPmMjFBZp5vyEk5SLLp0i_Dya04DuPbYPhB9xQqwFra-LpVeisUVvsdwC6xZXyUGNrcNMZSEJvOrPGv11osYoKt-s9bqw70v7ofInOG7X1cPW_5-jz6fFj-ZKs3p5flw-rRLOiCElV8kZXdV1ypgVrspiccVLlqSIURM0yoGUkOQNRsjwllGgKeUVpqjnVImdzdDP67pz97sEHubG9i_m9ZLEFlpWiTCN1O1LaWe8dNHLnui_lBkmJ3JcoY4kylihZFuH7Ed74oNZwQpULnd7CAeW5pPsxSk4v3SonwbA_Dq5_Hg
Cites_doi	10.1016/j.specom.2017.01.002 10.1527/tjsai.39-3_IDS6-B 10.1109/ACCESS.2022.3214977 10.23919/APSIPAASC55919.2022.9980105 10.1109/ICASSP40776.2020.9053795
ContentType	Journal Article
Copyright	2025 by The Acoustical Society of Japan 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Copyright_xml	– notice: 2025 by The Acoustical Society of Japan – notice: 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
DBID	AAYXX CITATION 7SP 7T9 7U5 8FD H8D L7M
DOI	10.1250/ast.e24.35
DatabaseName	CrossRef Electronics & Communications Abstracts Linguistics and Language Behavior Abstracts (LLBA) Solid State and Superconductivity Abstracts Technology Research Database Aerospace Database Advanced Technologies Database with Aerospace
DatabaseTitle	CrossRef Aerospace Database Linguistics and Language Behavior Abstracts (LLBA) Solid State and Superconductivity Abstracts Technology Research Database Advanced Technologies Database with Aerospace Electronics & Communications Abstracts
DatabaseTitleList	Aerospace Database
DeliveryMethod	fulltext_linktorsrc
Discipline	Physics
EISSN	1347-5177
EndPage	105
ExternalDocumentID	10_1250_ast_e24_35 article_ast_46_1_46_e24_35_article_char_en
GroupedDBID	23M 2WC 5GY 6J9 ACGFO ACIWK ALMA_UNASSIGNED_HOLDINGS CS3 E3Z EBS EJD GX1 JSF JSH KQ8 OVT RJT RNS RZJ TR2 XSB AAYXX CITATION 7SP 7T9 7U5 8FD H8D L7M
ID	FETCH-LOGICAL-c377t-b84fcbdd843c93f5517340b62a01e9d35e18c3743e98362010c1e6b112c41c963
ISSN	1346-3969
IngestDate	Mon Jun 30 10:06:53 EDT 2025 Tue Jul 01 01:17:39 EDT 2025 Wed Sep 03 06:30:31 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	true
IsScholarly	true
Issue	1
Language	English
License	https://creativecommons.org/licenses/by-nd/4.0
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c377t-b84fcbdd843c93f5517340b62a01e9d35e18c3743e98362010c1e6b112c41c963
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
OpenAccessLink	https://www.jstage.jst.go.jp/article/ast/46/1/46_e24.35/_article/-char/en
PQID	3177358982
PQPubID	1966373
PageCount	3
ParticipantIDs	proquest_journals_3177358982 crossref_primary_10_1250_ast_e24_35 jstage_primary_article_ast_46_1_46_e24_35_article_char_en
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2025/01/01 2025-1-1 20250101
PublicationDateYYYYMMDD	2025-01-01
PublicationDate_xml	– month: 01 year: 2025 text: 2025/01/01 day: 01
PublicationDecade	2020
PublicationPlace	Tokyo
PublicationPlace_xml	– name: Tokyo
PublicationTitle	Acoustical Science and Technology
PublicationTitleAlternate	Acoustical Science and Technology
PublicationYear	2025
Publisher	ACOUSTICAL SOCIETY OF JAPAN Japan Science and Technology Agency
Publisher_xml	– name: ACOUSTICAL SOCIETY OF JAPAN – name: Japan Science and Technology Agency
References	3) T. Nagata, H. Mori and T. Nose, "Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation," Speech Commun., 88, 137–148 (2017). 6) Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech," Proc. Int. Conf. Learning Representations (ICLR) 2021 (2021). 2) H. Mori and Y. Morimoto, "A listener-aware speech guidance that adaptively changes speech timing," J. Jpn. Soc. Artif. Intell., 39, 1–10 (2024), IDS6-B. 4) H. Mori and H. Nishino, "Neural conversational speech synthesis with flexible control of emotion dimensions," Proc. 2022 APSIPA ASC, pp. 432–436 (2022). 7) R. Sonobe, S. Takamichi and H. Saruwatari, "JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv:1711.00354 (2017). 8) R. Yamamoto, E. Song and J.-M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2020, pp. 6199–6203 (2020). 1) T. Iizuka and H. Mori, "How does a spontaneously speaking conversational agent affect user behavior?" IEEE Access, 10, 111042–111051 (2022). 5) H. Koiso, H. Amatani, Y. Den, Y. Iseki, Y. Ishimoto, W. Kashino, Y. Kawabata, K. Nishikawa, Y. Tanaka, Y. Usuda and Y. Watanabe, "Design and evaluation of the corpus of everyday Japanese conversation," Proc. 13th Language Resources and Evaluation Conf. (LREC) 2022, pp. 5587–5594 (2022). 1 2 3 4 5 6 7 8
References_xml	– reference: 3) T. Nagata, H. Mori and T. Nose, "Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation," Speech Commun., 88, 137–148 (2017). – reference: 1) T. Iizuka and H. Mori, "How does a spontaneously speaking conversational agent affect user behavior?" IEEE Access, 10, 111042–111051 (2022). – reference: 5) H. Koiso, H. Amatani, Y. Den, Y. Iseki, Y. Ishimoto, W. Kashino, Y. Kawabata, K. Nishikawa, Y. Tanaka, Y. Usuda and Y. Watanabe, "Design and evaluation of the corpus of everyday Japanese conversation," Proc. 13th Language Resources and Evaluation Conf. (LREC) 2022, pp. 5587–5594 (2022). – reference: 7) R. Sonobe, S. Takamichi and H. Saruwatari, "JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv:1711.00354 (2017). – reference: 6) Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech," Proc. Int. Conf. Learning Representations (ICLR) 2021 (2021). – reference: 8) R. Yamamoto, E. Song and J.-M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2020, pp. 6199–6203 (2020). – reference: 2) H. Mori and Y. Morimoto, "A listener-aware speech guidance that adaptively changes speech timing," J. Jpn. Soc. Artif. Intell., 39, 1–10 (2024), IDS6-B. – reference: 4) H. Mori and H. Nishino, "Neural conversational speech synthesis with flexible control of emotion dimensions," Proc. 2022 APSIPA ASC, pp. 432–436 (2022). – ident: 3 doi: 10.1016/j.specom.2017.01.002 – ident: 5 – ident: 2 doi: 10.1527/tjsai.39-3_IDS6-B – ident: 6 – ident: 1 doi: 10.1109/ACCESS.2022.3214977 – ident: 7 – ident: 4 doi: 10.23919/APSIPAASC55919.2022.9980105 – ident: 8 doi: 10.1109/ICASSP40776.2020.9053795
SSID	ssj0024956
Score	2.3416352
Snippet	In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by...
SourceID	proquest crossref jstage
SourceType	Aggregation Database Index Database Publisher
StartPage	103
SubjectTerms	Conversation Conversational agent Corpus linguistics Everyday conversation Japanese language Linguistics Prosody Speech recognition Speech synthesis
Title	Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis
URI	https://www.jstage.jst.go.jp/article/ast/46/1/46_e24.35/_article/-char/en https://www.proquest.com/docview/3177358982
Volume	46
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
ispartofPNX	Acoustical Science and Technology, 2025/01/01, Vol.46(1), pp.103-105
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZgAYkL4ikKC7IEN-RSx3YSH1cIqJCWC7tSb1Hs2LCslFR5CC2_nvEjabrlsHCxqnTspJnPns9TzwxCb62xTFmdECWEIVzLjIARNITSNAfKX1mjXTTy6dd0fc6_bMTmWnRJr5b691_jSv5Hq3AN9OqiZP9Bs9OgcAE-g36hBQ1DeyMdf7uqgb_FlCIGHv_KRXX4g-RtN3r5uq0x-sc7Z64q99eABV5J-mHnhS2hR7sdfGKGUbobR56T1xPd-NpfYxxljDXoD7zzp02IX19ftM3lxQSRoR0uy18hBq3py7nDIREzh0NYIxlPCZOhwsrSjNcyImisyBIX1uhbnAMorJJ0xWYGl_q468O1HMiZM1QdyCV8GZKa7CfMvmbIpuOFbmMDvQvoW0Dfgonb6E4C-whX4uLzhu6SMUpf3nf6RTF_LfR9v7vvHmO5-xNI-_dDy-3pyNlD9CDuI_BJAMUjdMvUj9E9f55Xd0-QnqCBG4tHaOB9aOCgbOyhgZsaz6CBHTRwiQM0MEBjlJ6g8RSdf_p49mFNYj0NolmW9UTl3GpVVTlnWjILXDljfKXSpFxRIysmDM1BkjMjc-A1sFPX1KQKGLnmVMNK_Qwd1U1tniNsRZoZkWoYx3CqMiWMkSqT1OSZlaVcoDfjOyu2IW1KcaiVBZLhdU4ycSp5GZ4W1DVBdvrKRSPC5F-g41EDRZySXQFkOGMil3ny4kYP8BLd3yH8GB317WBeAcns1WsPlD-fcITH
linkProvider	Geneva Foundation for Medical Education and Research
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Synthesis+of+everyday+conversational+speech+based+on+fine-tuning+with+a+corpus+for+speech+synthesis&rft.jtitle=Acoustical+science+and+technology&rft.au=Mori%2C+Hiroki&rft.au=Furukawa%2C+Kota&rft.date=2025-01-01&rft.issn=1346-3969&rft.eissn=1347-5177&rft.volume=46&rft.issue=1&rft.spage=103&rft.epage=105&rft_id=info:doi/10.1250%2Fast.e24.35&rft.externalDBID=n%2Fa&rft.externalDocID=10_1250_ast_e24_35
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1346-3969&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1346-3969&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1346-3969&client=summon