Cross-Speaker Style Transfer for Text-to-Speech Using Data Augmentation

We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 6797 - 6801
Main Authors Sam Ribeiro, Manuel, Roth, Julian, Comini, Giulia, Huybrechts, Goeric, Gabrys, Adam, Lorenzo-Trueba, Jaime
Format Conference Proceeding
LanguageEnglish
Published IEEE 23.05.2022
Subjects
Online AccessGet full text

Cover

Loading…
Abstract We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker's identity.
AbstractList We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of neutral non-expressive data from a target speaker and supporting conversational expressive data from different speakers. Our goal is to build a TTS system that is expressive, while retaining the target speaker's identity. The proposed approach relies on voice conversion to first generate high-quality data from the set of supporting expressive speakers. The voice converted data is then pooled with natural data from the target speaker and used to train a single-speaker multi-style TTS system. We provide evidence that this approach is efficient, flexible, and scalable. The method is evaluated using one or more supporting speakers, as well as a variable amount of supporting data. We further provide evidence that this approach allows some controllability of speaking style, when using multiple supporting speakers. We conclude by scaling our proposed technology to a set of 14 speakers across 7 languages. Results indicate that our technology consistently improves synthetic samples in terms of style similarity, while retaining the target speaker's identity.
Author Gabrys, Adam
Roth, Julian
Sam Ribeiro, Manuel
Comini, Giulia
Lorenzo-Trueba, Jaime
Huybrechts, Goeric
Author_xml – sequence: 1
  givenname: Manuel
  surname: Sam Ribeiro
  fullname: Sam Ribeiro, Manuel
  email: manuerib@amazon.com
  organization: Amazon Alexa,TTS Research
– sequence: 2
  givenname: Julian
  surname: Roth
  fullname: Roth, Julian
  organization: Amazon Alexa,TTS Research
– sequence: 3
  givenname: Giulia
  surname: Comini
  fullname: Comini, Giulia
  organization: Amazon Alexa,TTS Research
– sequence: 4
  givenname: Goeric
  surname: Huybrechts
  fullname: Huybrechts, Goeric
  email: huybrech@amazon.com
  organization: Amazon Alexa,TTS Research
– sequence: 5
  givenname: Adam
  surname: Gabrys
  fullname: Gabrys, Adam
  organization: Amazon Alexa,TTS Research
– sequence: 6
  givenname: Jaime
  surname: Lorenzo-Trueba
  fullname: Lorenzo-Trueba, Jaime
  email: truebaj@amazon.com
  organization: Amazon Alexa,TTS Research
BookMark eNotT9tKw0AUXEXBpvoFvuwPbN375bFErUJBISn4VjbNSY22m7K7gv37RiwMMwwMw0yBrsIQACHM6Iwx6h5ey3lVvUvhOJ9xOpIzUjPjLlDBtFaSjtCXaMKFcYQ5-nGDipS-KKXWSDtBizIOKZHqAP4bIq7ycQe4jj6kbrTdEHENv5nk4S8Cm0-8Sn3Y4kefPZ7_bPcQss_9EG7Rded3Ce7OOkWr56e6fCHLt8W4cUk2XNlMWg1CNp1WhstxEbjGgFVGKQfCqUZZK1rdSMW10BRka51vjbDWUOWEYiCm6P6_tweA9SH2ex-P6_NpcQLhR01S
CitedBy_id crossref_primary_10_3389_fcomm_2023_1089577
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/ICASSP43922.2022.9746179
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE Xplore
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Xplore
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISBN 1665405406
9781665405409
EISSN 2379-190X
EndPage 6801
ExternalDocumentID 9746179
Genre orig-research
GroupedDBID 23M
6IE
6IF
6IH
6IK
6IL
6IM
6IN
AAJGR
ABLEC
ACGFS
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
IPLJI
JC5
M43
OCL
RIE
RIL
RIO
RNS
ID FETCH-LOGICAL-c258t-d6e34bf65724379e9b7e857559e395b5883d6b4526360e4d89ad73887059351e3
IEDL.DBID RIE
IngestDate Wed Jun 26 19:25:26 EDT 2024
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c258t-d6e34bf65724379e9b7e857559e395b5883d6b4526360e4d89ad73887059351e3
PageCount 5
ParticipantIDs ieee_primary_9746179
PublicationCentury 2000
PublicationDate 2022-May-23
PublicationDateYYYYMMDD 2022-05-23
PublicationDate_xml – month: 05
  year: 2022
  text: 2022-May-23
  day: 23
PublicationDecade 2020
PublicationTitle ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
PublicationTitleAbbrev ICASSP
PublicationYear 2022
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0008748
Score 2.3354826
Snippet We address the problem of cross-speaker style transfer for text-to-speech (TTS) using data augmentation via voice conversion. We assume to have a corpus of...
SourceID ieee
SourceType Publisher
StartPage 6797
SubjectTerms Acoustics
Conferences
Controllability
cross-speaker
data augmentation
Signal processing
speaking style transfer
Speech processing
text-to-speech
Title Cross-Speaker Style Transfer for Text-to-Speech Using Data Augmentation
URI https://ieeexplore.ieee.org/document/9746179
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JTwIxFG6Qk15cwLinB4-WZabrkaCIJhiSgYQb6fJGExQImTnor7cdBlziwUszaWbaSZu877Xvfe9D6JppbbltU2J46giNbEw0SE2AC6tTDyBSBu7w4In3x_RxwiYVdLPlwgBAkXwGjfBYxPLdwubhqqzpfV8PuGoH7Qil1lytrdWVgspNpk5LNR-6nSQZerSNAtvKN-W3P0RUCgzp7aPBZvZ16siskWemYT9-FWb87-8doPoXWw8Ptzh0iCowP0J73woN1tB9N4AhSZagZ7DCSfb-CriAKT8A9n4rHoUTcLYIr4B9wUUmAb7Vmcad_PmtJCjN62jcuxt1-6SUUCA2YjIjjkNMTcqZKAoPgjICgiYnUxArZpiUseMmyIzHvAXUSaWdiL3hCUp_rA3xMarOF3M4QdhRnmpvH1vcCGolyIhqw8A44X1Kk8IpqoUlmS7XVTKm5Wqc_d19jnbDtoQ4fBRfoGq2yuHSw3tmrop9_QSTWaS4
link.rule.ids 310,311,783,787,792,793,799,23943,23944,25153,27938,55087
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1JTwIxFG4QD-rFBYy7PXi0LDNtp3MkKIICIQESbqTLG03QgZCZg_5622HAJR68TJpJt7TJ-17b970PoRsmpea6TonikSHU0z6RICQBHmgZWQARwnGHe33eHtPHCZsU0O2GCwMAWfAZVFwxe8s3c526q7Kq9X0t4IZbaNv61SJYsbU2dlcEVKxjdWphtdNsDIcDi7ee41vZT976h4xKhiKtfdRbj78KHplV0kRV9Mev1Iz_neABKn_x9fBgg0SHqADxEdr7lmqwhB6aDg7JcAFyBks8TN5fAWdAZTvA1nPFI3cGTuauCugXnMUS4DuZSNxIn99yilJcRuPW_ajZJrmIAtEeEwkxHHyqIs6CLPUghCoAp8rJQvBDppgQvuHKCY37vAbUiFCawLemx2n9sTr4x6gYz2M4QdhQHklrIWtcBVQLEB6VioEygfUqVQSnqOSWZLpY5cmY5qtx9vfva7TTHvW6026n_3SOdt0WuVd5z79AxWSZwqUF-0RdZXv8CWQsqAQ
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=proceeding&rft.title=ICASSP+2022+-+2022+IEEE+International+Conference+on+Acoustics%2C+Speech+and+Signal+Processing+%28ICASSP%29&rft.atitle=Cross-Speaker+Style+Transfer+for+Text-to-Speech+Using+Data+Augmentation&rft.au=Sam+Ribeiro%2C+Manuel&rft.au=Roth%2C+Julian&rft.au=Comini%2C+Giulia&rft.au=Huybrechts%2C+Goeric&rft.date=2022-05-23&rft.pub=IEEE&rft.eissn=2379-190X&rft.spage=6797&rft.epage=6801&rft_id=info:doi/10.1109%2FICASSP43922.2022.9746179&rft.externalDocID=9746179