Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis
In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversati...
Saved in:
Published in | Acoustical Science and Technology Vol. 46; no. 1; pp. 103 - 105 |
---|---|
Main Authors | , |
Format | Journal Article |
Language | English |
Published |
Tokyo
ACOUSTICAL SOCIETY OF JAPAN
01.01.2025
Japan Science and Technology Agency |
Subjects | |
Online Access | Get full text |
ISSN | 1346-3969 1347-5177 |
DOI | 10.1250/ast.e24.35 |
Cover
Loading…
Abstract | In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations. |
---|---|
AbstractList | In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by low-quality recordings in the Corpus of Everyday Japanese Conversation (CEJC). Initially, the FastSpeech 2 model is trained on the conversation corpus and subsequently fine-tuned on a corpus for speech synthesis. Experimental results show that this fine-tuning approach enhances synthesis quality while preserving the nuances of everyday conversations. |
ArticleNumber | e24.35 |
Author | Mori, Hiroki Furukawa, Kota |
Author_xml | – sequence: 1 fullname: Mori, Hiroki organization: School of Engineering, Utsunomiya University – sequence: 1 fullname: Furukawa, Kota organization: School of Engineering, Utsunomiya University |
BookMark | eNo9kEtLAzEUhYNUsK1u_AUBd8LUZJJ5ZKcUX1Bwoa5DJnOnM6UmNcko8-9NO7abey_c7xwOZ4YmxhpA6JqSBU0zcqd8WEDKFyw7Q1PKeJFktCgmhztPmMjFBZp5vyEk5SLLp0i_Dya04DuPbYPhB9xQqwFra-LpVeisUVvsdwC6xZXyUGNrcNMZSEJvOrPGv11osYoKt-s9bqw70v7ofInOG7X1cPW_5-jz6fFj-ZKs3p5flw-rRLOiCElV8kZXdV1ypgVrspiccVLlqSIURM0yoGUkOQNRsjwllGgKeUVpqjnVImdzdDP67pz97sEHubG9i_m9ZLEFlpWiTCN1O1LaWe8dNHLnui_lBkmJ3JcoY4kylihZFuH7Ed74oNZwQpULnd7CAeW5pPsxSk4v3SonwbA_Dq5_Hg |
Cites_doi | 10.1016/j.specom.2017.01.002 10.1527/tjsai.39-3_IDS6-B 10.1109/ACCESS.2022.3214977 10.23919/APSIPAASC55919.2022.9980105 10.1109/ICASSP40776.2020.9053795 |
ContentType | Journal Article |
Copyright | 2025 by The Acoustical Society of Japan 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
Copyright_xml | – notice: 2025 by The Acoustical Society of Japan – notice: 2025. This work is published under https://creativecommons.org/licenses/by-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. |
DBID | AAYXX CITATION 7SP 7T9 7U5 8FD H8D L7M |
DOI | 10.1250/ast.e24.35 |
DatabaseName | CrossRef Electronics & Communications Abstracts Linguistics and Language Behavior Abstracts (LLBA) Solid State and Superconductivity Abstracts Technology Research Database Aerospace Database Advanced Technologies Database with Aerospace |
DatabaseTitle | CrossRef Aerospace Database Linguistics and Language Behavior Abstracts (LLBA) Solid State and Superconductivity Abstracts Technology Research Database Advanced Technologies Database with Aerospace Electronics & Communications Abstracts |
DatabaseTitleList | Aerospace Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 1347-5177 |
EndPage | 105 |
ExternalDocumentID | 10_1250_ast_e24_35 article_ast_46_1_46_e24_35_article_char_en |
GroupedDBID | 23M 2WC 5GY 6J9 ACGFO ACIWK ALMA_UNASSIGNED_HOLDINGS CS3 E3Z EBS EJD GX1 JSF JSH KQ8 OVT RJT RNS RZJ TR2 XSB AAYXX CITATION 7SP 7T9 7U5 8FD H8D L7M |
ID | FETCH-LOGICAL-c377t-b84fcbdd843c93f5517340b62a01e9d35e18c3743e98362010c1e6b112c41c963 |
ISSN | 1346-3969 |
IngestDate | Mon Jun 30 10:06:53 EDT 2025 Tue Jul 01 01:17:39 EDT 2025 Wed Sep 03 06:30:31 EDT 2025 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 1 |
Language | English |
License | https://creativecommons.org/licenses/by-nd/4.0 |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c377t-b84fcbdd843c93f5517340b62a01e9d35e18c3743e98362010c1e6b112c41c963 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
OpenAccessLink | https://www.jstage.jst.go.jp/article/ast/46/1/46_e24.35/_article/-char/en |
PQID | 3177358982 |
PQPubID | 1966373 |
PageCount | 3 |
ParticipantIDs | proquest_journals_3177358982 crossref_primary_10_1250_ast_e24_35 jstage_primary_article_ast_46_1_46_e24_35_article_char_en |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2025/01/01 2025-1-1 20250101 |
PublicationDateYYYYMMDD | 2025-01-01 |
PublicationDate_xml | – month: 01 year: 2025 text: 2025/01/01 day: 01 |
PublicationDecade | 2020 |
PublicationPlace | Tokyo |
PublicationPlace_xml | – name: Tokyo |
PublicationTitle | Acoustical Science and Technology |
PublicationTitleAlternate | Acoustical Science and Technology |
PublicationYear | 2025 |
Publisher | ACOUSTICAL SOCIETY OF JAPAN Japan Science and Technology Agency |
Publisher_xml | – name: ACOUSTICAL SOCIETY OF JAPAN – name: Japan Science and Technology Agency |
References | 3) T. Nagata, H. Mori and T. Nose, "Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation," Speech Commun., 88, 137–148 (2017). 6) Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech," Proc. Int. Conf. Learning Representations (ICLR) 2021 (2021). 2) H. Mori and Y. Morimoto, "A listener-aware speech guidance that adaptively changes speech timing," J. Jpn. Soc. Artif. Intell., 39, 1–10 (2024), IDS6-B. 4) H. Mori and H. Nishino, "Neural conversational speech synthesis with flexible control of emotion dimensions," Proc. 2022 APSIPA ASC, pp. 432–436 (2022). 7) R. Sonobe, S. Takamichi and H. Saruwatari, "JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv:1711.00354 (2017). 8) R. Yamamoto, E. Song and J.-M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2020, pp. 6199–6203 (2020). 1) T. Iizuka and H. Mori, "How does a spontaneously speaking conversational agent affect user behavior?" IEEE Access, 10, 111042–111051 (2022). 5) H. Koiso, H. Amatani, Y. Den, Y. Iseki, Y. Ishimoto, W. Kashino, Y. Kawabata, K. Nishikawa, Y. Tanaka, Y. Usuda and Y. Watanabe, "Design and evaluation of the corpus of everyday Japanese conversation," Proc. 13th Language Resources and Evaluation Conf. (LREC) 2022, pp. 5587–5594 (2022). 1 2 3 4 5 6 7 8 |
References_xml | – reference: 3) T. Nagata, H. Mori and T. Nose, "Dimensional paralinguistic information control based on multiple-regression HSMM for spontaneous dialogue speech synthesis with robust parameter estimation," Speech Commun., 88, 137–148 (2017). – reference: 1) T. Iizuka and H. Mori, "How does a spontaneously speaking conversational agent affect user behavior?" IEEE Access, 10, 111042–111051 (2022). – reference: 5) H. Koiso, H. Amatani, Y. Den, Y. Iseki, Y. Ishimoto, W. Kashino, Y. Kawabata, K. Nishikawa, Y. Tanaka, Y. Usuda and Y. Watanabe, "Design and evaluation of the corpus of everyday Japanese conversation," Proc. 13th Language Resources and Evaluation Conf. (LREC) 2022, pp. 5587–5594 (2022). – reference: 7) R. Sonobe, S. Takamichi and H. Saruwatari, "JSUT corpus: Free large-scale Japanese speech corpus for end-to-end speech synthesis," arXiv:1711.00354 (2017). – reference: 6) Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao and T.-Y. Liu, "FastSpeech 2: Fast and high-quality end-to-end text to speech," Proc. Int. Conf. Learning Representations (ICLR) 2021 (2021). – reference: 8) R. Yamamoto, E. Song and J.-M. Kim, "Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram," Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP) 2020, pp. 6199–6203 (2020). – reference: 2) H. Mori and Y. Morimoto, "A listener-aware speech guidance that adaptively changes speech timing," J. Jpn. Soc. Artif. Intell., 39, 1–10 (2024), IDS6-B. – reference: 4) H. Mori and H. Nishino, "Neural conversational speech synthesis with flexible control of emotion dimensions," Proc. 2022 APSIPA ASC, pp. 432–436 (2022). – ident: 3 doi: 10.1016/j.specom.2017.01.002 – ident: 5 – ident: 2 doi: 10.1527/tjsai.39-3_IDS6-B – ident: 6 – ident: 1 doi: 10.1109/ACCESS.2022.3214977 – ident: 7 – ident: 4 doi: 10.23919/APSIPAASC55919.2022.9980105 – ident: 8 doi: 10.1109/ICASSP40776.2020.9053795 |
SSID | ssj0024956 |
Score | 2.3416352 |
Snippet | In this letter, we propose a separate modeling of prosodic and segmental features for everyday conversational speech synthesis, addressing challenges posed by... |
SourceID | proquest crossref jstage |
SourceType | Aggregation Database Index Database Publisher |
StartPage | 103 |
SubjectTerms | Conversation Conversational agent Corpus linguistics Everyday conversation Japanese language Linguistics Prosody Speech recognition Speech synthesis |
Title | Synthesis of everyday conversational speech based on fine-tuning with a corpus for speech synthesis |
URI | https://www.jstage.jst.go.jp/article/ast/46/1/46_e24.35/_article/-char/en https://www.proquest.com/docview/3177358982 |
Volume | 46 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
ispartofPNX | Acoustical Science and Technology, 2025/01/01, Vol.46(1), pp.103-105 |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lj9MwELZgAYkL4ikKC7IEN-RSx3YSH1cIqJCWC7tSb1Hs2LCslFR5CC2_nvEjabrlsHCxqnTspJnPns9TzwxCb62xTFmdECWEIVzLjIARNITSNAfKX1mjXTTy6dd0fc6_bMTmWnRJr5b691_jSv5Hq3AN9OqiZP9Bs9OgcAE-g36hBQ1DeyMdf7uqgb_FlCIGHv_KRXX4g-RtN3r5uq0x-sc7Z64q99eABV5J-mHnhS2hR7sdfGKGUbobR56T1xPd-NpfYxxljDXoD7zzp02IX19ftM3lxQSRoR0uy18hBq3py7nDIREzh0NYIxlPCZOhwsrSjNcyImisyBIX1uhbnAMorJJ0xWYGl_q468O1HMiZM1QdyCV8GZKa7CfMvmbIpuOFbmMDvQvoW0Dfgonb6E4C-whX4uLzhu6SMUpf3nf6RTF_LfR9v7vvHmO5-xNI-_dDy-3pyNlD9CDuI_BJAMUjdMvUj9E9f55Xd0-QnqCBG4tHaOB9aOCgbOyhgZsaz6CBHTRwiQM0MEBjlJ6g8RSdf_p49mFNYj0NolmW9UTl3GpVVTlnWjILXDljfKXSpFxRIysmDM1BkjMjc-A1sFPX1KQKGLnmVMNK_Qwd1U1tniNsRZoZkWoYx3CqMiWMkSqT1OSZlaVcoDfjOyu2IW1KcaiVBZLhdU4ycSp5GZ4W1DVBdvrKRSPC5F-g41EDRZySXQFkOGMil3ny4kYP8BLd3yH8GB317WBeAcns1WsPlD-fcITH |
linkProvider | Geneva Foundation for Medical Education and Research |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Synthesis+of+everyday+conversational+speech+based+on+fine-tuning+with+a+corpus+for+speech+synthesis&rft.jtitle=Acoustical+science+and+technology&rft.au=Mori%2C+Hiroki&rft.au=Furukawa%2C+Kota&rft.date=2025-01-01&rft.issn=1346-3969&rft.eissn=1347-5177&rft.volume=46&rft.issue=1&rft.spage=103&rft.epage=105&rft_id=info:doi/10.1250%2Fast.e24.35&rft.externalDBID=n%2Fa&rft.externalDocID=10_1250_ast_e24_35 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1346-3969&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1346-3969&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1346-3969&client=summon |