Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis

This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joi...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on audio, speech, and language processing Vol. 14; no. 4; pp. 1109 - 1116
Main Authors	Wu, Chung-Hsien, Hsia, Chi-Chun, Liu, Te-Hsien, Wang, Jhing-Fa
Format	Journal Article
Language	English
Published	IEEE 01.07.2006
Subjects	Algorithm design and analysis Bi-HMM voice conversion Computer science Conversion Decision trees embedded duration model expressive speech synthesis Hidden Markov models Humans Mathematical models Natural language processing prosody conversion Signal synthesis Spatial databases Speech Speech analysis Speech recognition Speech synthesis Testing Trains Voice
Online Access	Get full text

Cover

Loading…

Abstract	This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for spectrum conversion from neutral to expressive speech. Gamma distribution is embedded as the duration model for each state in source and target HMMs. The expressive style-dependent decision trees achieve prosodic conversion. The STRAIGHT algorithm is adopted for the analysis and synthesis process. A set of small-sized speech databases for each expressive style is designed and collected to train the DeBi-HMM voice conversion models. Several experiments with statistical hypothesis testing are conducted to evaluate the quality of synthetic speech as perceived by human subjects. Compared with previous voice conversion methods, the proposed method exhibits encouraging potential in expressive speech synthesis
AbstractList	This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis. DeBi-HMM is named for its duration-embedded characteristic of the two HMMs for modeling the source and target speech signals, respectively. Joint estimation of source and target HMMs is exploited for spectrum conversion from neutral to expressive speech. Gamma distribution is embedded as the duration model for each state in source and target HMMs. The expressive style-dependent decision trees achieve prosodic conversion. The STRAIGHT algorithm is adopted for the analysis and synthesis process. A set of small-sized speech databases for each expressive style is designed and collected to train the DeBi-HMM voice conversion models. Several experiments with statistical hypothesis testing are conducted to evaluate the quality of synthetic speech as perceived by human subjects. Compared with previous voice conversion methods, the proposed method exhibits encouraging potential in expressive speech synthesis
Author	Te-Hsien Liu Chung-Hsien Wu Jhing-Fa Wang Chi-Chun Hsia
Author_xml	– sequence: 1 givenname: Chung-Hsien surname: Wu fullname: Wu, Chung-Hsien – sequence: 2 givenname: Chi-Chun surname: Hsia fullname: Hsia, Chi-Chun – sequence: 3 givenname: Te-Hsien surname: Liu fullname: Liu, Te-Hsien – sequence: 4 givenname: Jhing-Fa surname: Wang fullname: Wang, Jhing-Fa
BookMark	eNpFkMFLwzAUh4MouE3vgpfcPHUmTZomxzHUCRsKDq-laV5dZGtrXjvcf29LRU_v9-D7PXjflJxXdQWE3HA255yZ--3ibT2PGVNznSrO4zMy4Umio9TE8vwvc3VJpoifjEmhJJ-Q1_faF0CLujpCQF9XtENffVDXhbzt1wgOFpwDR62PVpsN0rIOFL6bAIj-CBQbgGJH8VS1O0CPV-SizPcI179zRraPD9vlKlq_PD0vF-uoEFq3kTBSc6mMMikDVTATg01SJWKb2DTPnbNSCa6MTKXLwVjLhCyV1A5kn5yYkbvxbBPqrw6wzQ4eC9jv8wrqDjOtjUhN_2JPspEsQo0YoMya4A95OGWcZYO6bFCXDeqyUV1fuR0rHgD-cTU4Y-IHt_lsWw
CODEN	ITASD8
CitedBy_id	crossref_primary_10_1016_j_specom_2014_10_005 crossref_primary_10_1007_s11767_006_0236_9 crossref_primary_10_1109_TASL_2012_2188628 crossref_primary_10_1007_s11432_013_4799_4 crossref_primary_10_1016_j_specom_2017_01_008 crossref_primary_10_1109_TASLP_2016_2537982 crossref_primary_10_1109_TASL_2006_889752 crossref_primary_10_1109_TASL_2008_2006578 crossref_primary_10_1007_s11042_015_3039_x crossref_primary_10_1007_s10772_012_9145_5 crossref_primary_10_1109_TASL_2009_2034771 crossref_primary_10_1016_j_specom_2014_12_004 crossref_primary_10_1016_j_heliyon_2023_e15090 crossref_primary_10_1109_TASLP_2021_3066047 crossref_primary_10_1109_TASL_2012_2213247 crossref_primary_10_9746_jcmsi_2_365 crossref_primary_10_1080_02533839_2010_9671694 crossref_primary_10_1007_s11767_009_0003_9 crossref_primary_10_1186_1687_4722_2012_21 crossref_primary_10_1016_j_neucom_2007_08_010 crossref_primary_10_1016_j_specom_2008_09_006 crossref_primary_10_1109_TASLP_2014_2339738 crossref_primary_10_1109_TASLP_2013_2297018 crossref_primary_10_1016_j_artint_2009_11_011 crossref_primary_10_1109_TASLP_2019_2960721
Cites_doi	10.4159/harvard.9780674732469 10.3115/1075671.1075746 10.1109/TSA.2003.818114 10.1109/ICASSP.1988.196671 10.1109/89.759037 10.1121/1.405558 10.1109/89.496221 10.1109/ICASSP.1997.596185 10.1109/ICASSP.1998.674423 10.1016/S0885-2308(86)80009-2 10.1016/S0167-6393(02)00081-X 10.1109/89.661472 10.1109/89.232612 10.1109/ICASSP.2005.1415037 10.1093/ietisy/e88-d.3.502 10.1121/1.395275 10.1016/j.specom.2004.02.003 10.1016/S0167-6393(00)00075-3 10.1016/S0167-6393(98)00085-5
ContentType	Journal Article
DBID	97E RIA RIE AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1109/TASL.2006.876112
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) 1998-Present IEEE Electronic Library (IEL) CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering Computer Science
EISSN	1558-7924
EndPage	1116
ExternalDocumentID	10_1109_TASL_2006_876112 1643640
Genre	orig-research
GroupedDBID	0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AASAJ ABQJQ ABVLG AETIX ALMA_UNASSIGNED_HOLDINGS B-7 BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD F5P HZ~ IFIPE IPLJI JAVBF LAI M43 O9- OCL RIA RIE RIG RNS AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c388t-394814696970e6c092eb57632b5b7aaddb463169474dae9bb034f648de4034d3
IEDL.DBID	RIE
ISSN	1558-7916
IngestDate	Fri Aug 16 22:50:57 EDT 2024 Fri Aug 23 03:24:21 EDT 2024 Wed Jun 26 19:20:42 EDT 2024
IsPeerReviewed	true
IsScholarly	true
Issue	4
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c388t-394814696970e6c092eb57632b5b7aaddb463169474dae9bb034f648de4034d3
Notes	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23
PQID	889379641
PQPubID	23500
PageCount	8
ParticipantIDs	crossref_primary_10_1109_TASL_2006_876112 proquest_miscellaneous_889379641 ieee_primary_1643640
PublicationCentury	2000
PublicationDate	2006-07-01
PublicationDateYYYYMMDD	2006-07-01
PublicationDate_xml	– month: 07 year: 2006 text: 2006-07-01 day: 01
PublicationDecade	2000
PublicationTitle	IEEE transactions on audio, speech, and language processing
PublicationTitleAbbrev	TASL
PublicationYear	2006
Publisher	IEEE
Publisher_xml	– name: IEEE
References	ref13 shott (ref30) 1990 ref12 ref15 ref14 shuang (ref18) 2004 ref2 huang (ref25) 2001 ref17 hwang (ref21) 1996; 1 ref16 ref19 (ref27) 1993 kay (ref24) 1993 dempster (ref23) 1977; 39 kawanami (ref4) 2003 ref26 ref20 ref22 ref28 ref8 ref7 schröder schroder (ref1) 2001; 1 duxans (ref10) 2004 brown (ref31) 1973 kim (ref11) 1997; 5 eide (ref29) 2004 ref9 ref3 ref6 ref5
References_xml	– year: 1990 ident: ref30 publication-title: Statistics for health professionals contributor: fullname: shott – year: 1973 ident: ref31 publication-title: A First Language The Early Stages doi: 10.4159/harvard.9780674732469 contributor: fullname: brown – ident: ref13 doi: 10.3115/1075671.1075746 – volume: 1 start-page: 561 year: 2001 ident: ref1 article-title: emotional speech synthesis&#821a review publication-title: Proc of Eurospeech contributor: fullname: schröder schroder – volume: 39 start-page: 1 year: 1977 ident: ref23 article-title: maximum likelihood from incomplete data via the em algorithm publication-title: J R Statist Soc B contributor: fullname: dempster – start-page: 2401 year: 2003 ident: ref4 article-title: gmm-based voice conversion applied to emotional speech synthesis publication-title: Proc of Eurospeech contributor: fullname: kawanami – start-page: 5 year: 2004 ident: ref10 article-title: including dynamic and phonetic information in voice conversion systems publication-title: Proc ICSLP contributor: fullname: duxans – ident: ref12 doi: 10.1109/TSA.2003.818114 – year: 1993 ident: ref27 publication-title: The CKIP Categorical Classification of Mandarin Chinese (In Chinese) – ident: ref6 doi: 10.1109/ICASSP.1988.196671 – year: 1993 ident: ref24 publication-title: Fundamentals of Statistical Signal Processing Estimation Theory contributor: fullname: kay – ident: ref17 doi: 10.1109/89.759037 – ident: ref5 doi: 10.1121/1.405558 – ident: ref26 doi: 10.1109/89.496221 – ident: ref19 doi: 10.1109/ICASSP.1997.596185 – ident: ref8 doi: 10.1109/ICASSP.1998.674423 – ident: ref22 doi: 10.1016/S0885-2308(86)80009-2 – ident: ref2 doi: 10.1016/S0167-6393(02)00081-X – start-page: 1197 year: 2004 ident: ref18 article-title: a novel voice conversion system based on codebook mapping with phoneme-tied weighting publication-title: Proc ICSLP contributor: fullname: shuang – ident: ref7 doi: 10.1109/89.661472 – volume: 5 start-page: 2519 year: 1997 ident: ref11 article-title: hidden markov model-based voice conversion using dynamic characteristics of speaker publication-title: Proc EUROSPEECH contributor: fullname: kim – ident: ref14 doi: 10.1109/89.232612 – ident: ref9 doi: 10.1109/ICASSP.2005.1415037 – ident: ref3 doi: 10.1093/ietisy/e88-d.3.502 – ident: ref16 doi: 10.1121/1.395275 – start-page: 79 year: 2004 ident: ref29 article-title: a corpus-based approach to expressive speech synthesis publication-title: Proc 5th ISCA Speech Synthesis Workshop contributor: fullname: eide – year: 2001 ident: ref25 publication-title: Spoken Language Processing A Guide to Theory Algorithm and System Development contributor: fullname: huang – ident: ref28 doi: 10.1016/j.specom.2004.02.003 – volume: 1 start-page: 87 year: 1996 ident: ref21 article-title: a mandarin text-to-speech system publication-title: Int J Computat Linguis Chin Lang Process contributor: fullname: hwang – ident: ref15 doi: 10.1016/S0167-6393(00)00075-3 – ident: ref20 doi: 10.1016/S0167-6393(98)00085-5
SSID	ssj0043641
Score	2.1754017
Snippet	This paper presents an expressive voice conversion model (DeBi-HMM) as the post processing of a text-to-speech (TTS) system for expressive speech synthesis....
SourceID	proquest crossref ieee
SourceType	Aggregation Database Publisher
StartPage	1109
SubjectTerms	Algorithm design and analysis Bi-HMM voice conversion Computer science Conversion Decision trees embedded duration model expressive speech synthesis Hidden Markov models Humans Mathematical models Natural language processing prosody conversion Signal synthesis Spatial databases Speech Speech analysis Speech recognition Speech synthesis Testing Trains Voice
Title	Voice conversion using duration-embedded bi-HMMs for expressive speech synthesis
URI	https://ieeexplore.ieee.org/document/1643640 https://search.proquest.com/docview/889379641
Volume	14
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3db9MwED91fYKHdXRMKwPkB14mkc5JHCd-rCamClE0aR3qWxTb1zEh0oo0k7a_nrPdlDH2wJul2MrJZ9-H73d3AB-MlnJpFI-4tlkkbJ5EeulAADqv4tjGeW5dcvLsq5xei8-LbNGDj7tcGET04DMcu6GP5duVad1T2RmZ9qkU5KDvFTwJuVqd1HUfQm3UrHAlGHchSa7O5pOrLyHsQFc_jpO_VJDvqfKPIPba5WIAs46uACr5MW43emwenpRs_F_CD2B_a2aySTgXr6CH9RAGXQsHtr3RQ3j5qB7hIVx-W5HcYB6J7p_RmIPF3zDbhnMS4U-NJKks07fRdDZrGJm8jAjyYNo7ZM0a0XxnzX1NdmVz27yG-cWn-fk02rZciExaFJsodcVbyGOWKucoDVcJavJI0kRnxDyShVrINJZK5MJWqLTmqVhKUVgUNLLpEfTrVY3HwCohZVIpbkn5icwkFU2LjcJEFTIzUo3gtGNCuQ6FNUrvkHBVOoa5_piyDAwbwaHb0z_zwnaOgHVcK-lSuEhHVeOqbcrCWWGKfvjm-ZUn8CI8pTjY7Vvob361-I6Mi41-70_Vb86Dy9w
link.rule.ids	315,786,790,802,27955,27956,55107
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT9wwEB4hOLQ9AIVWLBTqQy9IzeIkjhMfUQXathtUiS3iFsX2bIsQWUQ2SPDrGdsbSqGH3izFVkYez8vzeQbgk9FSTo3iEdc2i4TNk0hPHQhA53Uc2zjPrXucXJ7I0U_x7Tw7X4LPj29hENGDz3Dohj6Xb2emc1dlB-Tap1JQgL5Cdp7n4bVWr3fdp1AdNStcEcbHpCRXB5PD03FIPJDwx3HylxHyXVVeqGJvX47XoOwpC7CSy2E310Nz_6xo4_-Svg6rC0eTHYaT8RaWsNmAtb6JA1vI9Aa8eVKRcBN-nM1IczCPRfcXacwB438x24WTEuGVRtJVlumLaFSWLSOnlxFBHk57i6y9RjS_WXvXkGfZXrTvYHJ8NPkyihZNFyKTFsU8Sl35FoqZpco5SsNVgppikjTRGbGPtKEWMo2lErmwNSqteSqmUhQWBY1s-h6Wm1mDW8BqIWVSK27J_InMJDVNi43CRBUyM1INYL9nQnUdSmtUPiThqnIMcx0yZRUYNoBNt6d_5oXtHADruVaRWLhcR93grGurwvlhin64_e-VH-HVaFKOq_HXk-878DpcrDgQ7gdYnt90uEuuxlzv-RP2ALevzzA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Voice+conversion+using+duration-embedded+bi-HMMs+for+expressive+speech+synthesis&rft.jtitle=IEEE+transactions+on+audio%2C+speech%2C+and+language+processing&rft.au=Chung-Hsien+Wu&rft.au=Chi-Chun+Hsia&rft.au=Te-Hsien+Liu&rft.au=Jhing-Fa+Wang&rft.date=2006-07-01&rft.pub=IEEE&rft.issn=1558-7916&rft.eissn=1558-7924&rft.volume=14&rft.issue=4&rft.spage=1109&rft.epage=1116&rft_id=info:doi/10.1109%2FTASL.2006.876112&rft.externalDocID=1643640
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1558-7916&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1558-7916&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1558-7916&client=summon