Cross-Speaker Style Transfer for Text-to-Speech Using Data Augmentation
We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a...
Saved in:
Published in | ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 6797 - 6801 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
23.05.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker's identity. |
---|---|
AbstractList | We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker's identity. |
Author | Gabrys, Adam Roth, Julian Sam Ribeiro, Manuel Comini, Giulia Lorenzo-Trueba, Jaime Huybrechts, Goeric |
Author_xml | – sequence: 1 givenname: Manuel surname: Sam Ribeiro fullname: Sam Ribeiro, Manuel email: manuerib@amazon.com organization: Amazon Alexa,TTS Research – sequence: 2 givenname: Julian surname: Roth fullname: Roth, Julian organization: Amazon Alexa,TTS Research – sequence: 3 givenname: Giulia surname: Comini fullname: Comini, Giulia organization: Amazon Alexa,TTS Research – sequence: 4 givenname: Goeric surname: Huybrechts fullname: Huybrechts, Goeric email: huybrech@amazon.com organization: Amazon Alexa,TTS Research – sequence: 5 givenname: Adam surname: Gabrys fullname: Gabrys, Adam organization: Amazon Alexa,TTS Research – sequence: 6 givenname: Jaime surname: Lorenzo-Trueba fullname: Lorenzo-Trueba, Jaime email: truebaj@amazon.com organization: Amazon Alexa,TTS Research |
BookMark | eNotT9tKw0AUXEXBpvoFvuwPbN375bFErUJBISn4VjbNSY22m7K7gv37RiwMMwwMw0yBrsIQACHM6Iwx6h5ey3lVvUvhOJ9xOpIzUjPjLlDBtFaSjtCXaMKFcYQ5-nGDipS-KKXWSDtBizIOKZHqAP4bIq7ycQe4jj6kbrTdEHENv5nk4S8Cm0-8Sn3Y4kefPZ7_bPcQss_9EG7Rded3Ce7OOkWr56e6fCHLt8W4cUk2XNlMWg1CNp1WhstxEbjGgFVGKQfCqUZZK1rdSMW10BRka51vjbDWUOWEYiCm6P6_tweA9SH2ex-P6_NpcQLhR01S |
CitedBy_id | crossref_primary_10_3389_fcomm_2023_1089577 |
ContentType | Conference Proceeding |
DBID | 6IE 6IH CBEJK RIE RIO |
DOI | 10.1109/ICASSP43922.2022.9746179 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Proceedings Order Plan (POP) 1998-present by volume IEEE Xplore All Conference Proceedings IEEE Xplore IEEE Proceedings Order Plans (POP) 1998-present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Xplore url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISBN | 1665405406 9781665405409 |
EISSN | 2379-190X |
EndPage | 6801 |
ExternalDocumentID | 9746179 |
Genre | orig-research |
GroupedDBID | 23M 6IE 6IF 6IH 6IK 6IL 6IM 6IN AAJGR ABLEC ACGFS ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IEGSK IJVOP IPLJI JC5 M43 OCL RIE RIL RIO RNS |
ID | FETCH-LOGICAL-c258t-d6e34bf65724379e9b7e857559e395b5883d6b4526360e4d89ad73887059351e3 |
IEDL.DBID | RIE |
IngestDate | Wed Jun 26 19:25:26 EDT 2024 |
IsPeerReviewed | false |
IsScholarly | true |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c258t-d6e34bf65724379e9b7e857559e395b5883d6b4526360e4d89ad73887059351e3 |
PageCount | 5 |
ParticipantIDs | ieee_primary_9746179 |
PublicationCentury | 2000 |
PublicationDate | 2022-May-23 |
PublicationDateYYYYMMDD | 2022-05-23 |
PublicationDate_xml | – month: 05 year: 2022 text: 2022-May-23 day: 23 |
PublicationDecade | 2020 |
PublicationTitle | ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) |
PublicationTitleAbbrev | ICASSP |
PublicationYear | 2022 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0008748 |
Score | 2.3354826 |
Snippet | We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 6797 |
SubjectTerms | Acoustics Conferences Controllability cross-speaker data augmentation Signal processing speaking style transfer Speech processing text-to-speech |
Title | Cross-Speaker Style Transfer for Text-to-Speech Using Data Augmentation |
URI | https://ieeexplore.ieee.org/document/9746179 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JTwIxFG6Qk15cwLinB4-WZabrkaCIJhiSgYQb6fJGExQImTnor7cdBlziwUszaWbaSZu877Xvfe9D6JppbbltU2J46giNbEw0SE2AC6tTDyBSBu7w4In3x_RxwiYVdLPlwgBAkXwGjfBYxPLdwubhqqzpfV8PuGoH7Qil1lytrdWVgspNpk5LNR-6nSQZerSNAtvKN-W3P0RUCgzp7aPBZvZ16siskWemYT9-FWb87-8doPoXWw8Ptzh0iCowP0J73woN1tB9N4AhSZagZ7DCSfb-CriAKT8A9n4rHoUTcLYIr4B9wUUmAb7Vmcad_PmtJCjN62jcuxt1-6SUUCA2YjIjjkNMTcqZKAoPgjICgiYnUxArZpiUseMmyIzHvAXUSaWdiL3hCUp_rA3xMarOF3M4QdhRnmpvH1vcCGolyIhqw8A44X1Kk8IpqoUlmS7XVTKm5Wqc_d19jnbDtoQ4fBRfoGq2yuHSw3tmrop9_QSTWaS4 |
link.rule.ids | 310,311,783,787,792,793,799,23943,23944,25153,27938,55087 |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JTwIxFG4QD-rFBYy7PXi0LDNtp3MkKIICIQESbqTLG03QgZCZg_5622HAJR68TJpJt7TJ-17b970PoRsmpea6TonikSHU0z6RICQBHmgZWQARwnGHe33eHtPHCZsU0O2GCwMAWfAZVFwxe8s3c526q7Kq9X0t4IZbaNv61SJYsbU2dlcEVKxjdWphtdNsDIcDi7ee41vZT976h4xKhiKtfdRbj78KHplV0kRV9Mev1Iz_neABKn_x9fBgg0SHqADxEdr7lmqwhB6aDg7JcAFyBks8TN5fAWdAZTvA1nPFI3cGTuauCugXnMUS4DuZSNxIn99yilJcRuPW_ajZJrmIAtEeEwkxHHyqIs6CLPUghCoAp8rJQvBDppgQvuHKCY37vAbUiFCawLemx2n9sTr4x6gYz2M4QdhQHklrIWtcBVQLEB6VioEygfUqVQSnqOSWZLpY5cmY5qtx9vfva7TTHvW6026n_3SOdt0WuVd5z79AxWSZwqUF-0RdZXv8CWQsqAQ |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=ICASSP+2022+-+2022+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Cross-Speaker+Style+Transfer+for+Text-to-Speech+Using+Data+Augmentation&rft.au=Sam+Ribeiro%2C+Manuel&rft.au=Roth%2C+Julian&rft.au=Comini%2C+Giulia&rft.au=Huybrechts%2C+Goeric&rft.date=2022-05-23&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=6797&rft.epage=6801&rft_id=info:doi/10.1109%2FICASSP43922.2022.9746179&rft.externalDocID=9746179 |