SVSNet: An End-to-end Speaker Voice Similarity Assessment Model
Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. U...
Saved in:
Published in | arXiv.org |
---|---|
Main Authors | , , , , |
Format | Paper Journal Article |
Language | English |
Published |
Ithaca
Cornell University Library, arXiv.org
17.02.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw waveform as input to more completely utilize speech information for prediction. SVSNet consists of encoder, co-attention, distance calculation, and prediction modules and is trained in an end-to-end manner. The experimental results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that SVSNet outperforms well-known baseline systems in the assessment of speaker similarity at the utterance and system levels. |
---|---|
AbstractList | Neural evaluation metrics derived for numerous speech generation tasks have
recently attracted great attention. In this paper, we propose SVSNet, the first
end-to-end neural network model to assess the speaker voice similarity between
converted speech and natural speech for voice conversion tasks. Unlike most
neural evaluation metrics that use hand-crafted features, SVSNet directly takes
the raw waveform as input to more completely utilize speech information for
prediction. SVSNet consists of encoder, co-attention, distance calculation, and
prediction modules and is trained in an end-to-end manner. The experimental
results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020)
datasets show that SVSNet outperforms well-known baseline systems in the
assessment of speaker similarity at the utterance and system levels. Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first end-to-end neural network model to assess the speaker voice similarity between converted speech and natural speech for voice conversion tasks. Unlike most neural evaluation metrics that use hand-crafted features, SVSNet directly takes the raw waveform as input to more completely utilize speech information for prediction. SVSNet consists of encoder, co-attention, distance calculation, and prediction modules and is trained in an end-to-end manner. The experimental results on the Voice Conversion Challenge 2018 and 2020 (VCC2018 and VCC2020) datasets show that SVSNet outperforms well-known baseline systems in the assessment of speaker similarity at the utterance and system levels. |
Author | Wang, Hsin-Min Yamagishi, Junichi Cheng-Hung, Hu Yu-Huai Peng Tsao, Yu |
Author_xml | – sequence: 1 givenname: Hu surname: Cheng-Hung fullname: Cheng-Hung, Hu – sequence: 2 fullname: Yu-Huai Peng – sequence: 3 givenname: Junichi surname: Yamagishi fullname: Yamagishi, Junichi – sequence: 4 givenname: Yu surname: Tsao fullname: Tsao, Yu – sequence: 5 givenname: Hsin-Min surname: Wang fullname: Wang, Hsin-Min |
BackLink | https://doi.org/10.1109/LSP.2022.3152672$$DView published paper (Access to full text may be restricted) https://doi.org/10.48550/arXiv.2107.09392$$DView paper in arXiv |
BookMark | eNotz8tOwzAUBFALgUQp_QBWWGKd4Gs7D7NBVVUeUoFFqm4jp7mWUhI72Clq_54-WM1mNJpzQy6ts0jIHbBY5knCHrXfNb8xB5bFTAnFL8iICwFRLjm_JpMQNowxnmY8ScSIPBer4hOHJzq1dG7raHAR2poWPepv9HTlmjXSoumaVvtm2NNpCBhCh3agH67G9pZcGd0GnPznmCxf5svZW7T4en2fTReRTjhESWY0ok4VpiJXEtaAlQGQUINkXNfIKlbnXFQyA6iMTjkzoDODRuSGKSnG5P48e9KVvW867fflUVmelIfGw7nRe_ezxTCUG7f19vCpPEpTAUqA-AOiDVUh |
ContentType | Paper Journal Article |
Copyright | 2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
Copyright_xml | – notice: 2022. This work is published under http://arxiv.org/licenses/nonexclusive-distrib/1.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. – notice: http://arxiv.org/licenses/nonexclusive-distrib/1.0 |
DBID | 8FE 8FG ABJCF ABUWG AFKRA AZQEC BENPR BGLVJ CCPQU DWQXO HCIFZ L6V M7S PIMPY PQEST PQQKQ PQUKI PRINS PTHSS AKY GOX |
DOI | 10.48550/arxiv.2107.09392 |
DatabaseName | ProQuest SciTech Collection ProQuest Technology Collection Materials Science & Engineering Collection ProQuest Central (Alumni) ProQuest Central UK/Ireland ProQuest Central Essentials AUTh Library subscriptions: ProQuest Central Technology Collection ProQuest One Community College ProQuest Central SciTech Premium Collection (Proquest) (PQ_SDU_P3) ProQuest Engineering Collection Engineering Database Publicly Available Content Database (Proquest) (PQ_SDU_P3) ProQuest One Academic Eastern Edition (DO NOT USE) ProQuest One Academic ProQuest One Academic UKI Edition ProQuest Central China Engineering Collection arXiv Computer Science arXiv.org |
DatabaseTitle | Publicly Available Content Database Engineering Database Technology Collection ProQuest Central Essentials ProQuest One Academic Eastern Edition ProQuest Central (Alumni Edition) SciTech Premium Collection ProQuest One Community College ProQuest Technology Collection ProQuest SciTech Collection ProQuest Central China ProQuest Central ProQuest Engineering Collection ProQuest One Academic UKI Edition ProQuest Central Korea Materials Science & Engineering Collection ProQuest One Academic Engineering Collection |
DatabaseTitleList | Publicly Available Content Database |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository – sequence: 2 dbid: 8FG name: ProQuest Technology Collection url: https://search.proquest.com/technologycollection1 sourceTypes: Aggregation Database |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Physics |
EISSN | 2331-8422 |
ExternalDocumentID | 2107_09392 |
Genre | Working Paper/Pre-Print |
GroupedDBID | 8FE 8FG ABJCF ABUWG AFKRA ALMA_UNASSIGNED_HOLDINGS AZQEC BENPR BGLVJ CCPQU DWQXO FRJ HCIFZ L6V M7S M~E PIMPY PQEST PQQKQ PQUKI PRINS PTHSS AKY GOX |
ID | FETCH-LOGICAL-a521-57faeea69e638941c1ebf1141d1402ade0b0d823b4711bfa620f1a7fef38f0943 |
IEDL.DBID | 8FG |
IngestDate | Mon Jan 08 05:48:18 EST 2024 Thu Oct 10 18:37:59 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a521-57faeea69e638941c1ebf1141d1402ade0b0d823b4711bfa620f1a7fef38f0943 |
OpenAccessLink | https://www.proquest.com/docview/2553631931?pq-origsite=%requestingapplication% |
PQID | 2553631931 |
PQPubID | 2050157 |
ParticipantIDs | arxiv_primary_2107_09392 proquest_journals_2553631931 |
PublicationCentury | 2000 |
PublicationDate | 20220217 |
PublicationDateYYYYMMDD | 2022-02-17 |
PublicationDate_xml | – month: 02 year: 2022 text: 20220217 day: 17 |
PublicationDecade | 2020 |
PublicationPlace | Ithaca |
PublicationPlace_xml | – name: Ithaca |
PublicationTitle | arXiv.org |
PublicationYear | 2022 |
Publisher | Cornell University Library, arXiv.org |
Publisher_xml | – name: Cornell University Library, arXiv.org |
SSID | ssj0002672553 |
Score | 1.8358343 |
SecondaryResourceType | preprint |
Snippet | Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first... Neural evaluation metrics derived for numerous speech generation tasks have recently attracted great attention. In this paper, we propose SVSNet, the first... |
SourceID | arxiv proquest |
SourceType | Open Access Repository Aggregation Database |
SubjectTerms | Coders Computer Science - Learning Computer Science - Sound Evaluation Neural networks Similarity Speech recognition Waveforms |
SummonAdditionalLinks | – databaseName: arXiv.org dbid: GOX link: http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV09T8MwELVKJxYEAtRCQR5YDbWdOglbhQoVEmVoqbpFvuQsVUBatQHx8zk7KQwIyZNlDz5_3Hv2-R1jV-AUFTQCIn9143Qh0kTnIslTSBVIGYfv0U8TM36JHheDRYvx3V8Yu_laftb6wLC9IT4SXxPnTumQ3VPKh2w9PC_qx8kgxdW0_21HGDNU_Tlag7-4P2QHDdDjw3pmjlgLy2OCxvPpBKtbPiz5qCxEtRJYFny6RvuKGz5f0bbl0-X7kvgmwWM-_NHN5D5p2dsJm92PZndj0aQwEJb8ohjEziJak6IHBpHMJYIjBiIL4jXKFtiHfpEoDeQiJDhrVN9JGzt0OnE-5u-UtctViR3GB7mJDFqAxCvKxKl1ALmyGh0xljjKu6wTBp6ta5WKzNskCzbpst7OFlmzQrcZUQltaP9pefZ_z3O2r3y4v0-AEvdYu9p84AU54Qouw0x8A2ighho priority: 102 providerName: Cornell University |
Title | SVSNet: An End-to-end Speaker Voice Similarity Assessment Model |
URI | https://www.proquest.com/docview/2553631931 https://arxiv.org/abs/2107.09392 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3dS8MwEA-6IvjmJ5vO0Qdfsy1p17S-yJR9IGwON8feStJeYKjt3Kb45N_uJevmgyCUQBv60Et6d7-7y_0IuVaa4wUBVb4J3WgvpVHoJTRMIhVxxZiwx6MHw6D_7D_MWrMi4LYqyiq3OtEq6jRPTIy8ga6vF-B-8djt4p0a1iiTXS0oNPaJw7gQhroh7PZ2MRYeCPPaJplpW3c15PJr_llHnCPqiOVNAtSxj_6oYmtfukfEGckFLI_JHmQn5MCWZSarU3Slp-MhrG_cduZ2spSucwpZ6o4XIF9g6U5z_M3d8fxtjvgU3Wm3veuz6RqSs9czMul2Jvd9WlAeUIl2lLaElgAyiMA4Ej5LGCiNiIWliIO4TKGpmmnIPYUmhSktA97UTAoN2gu1qRE8J6Usz6BM3FYS-AFIpULTgUZEUiuVcOmBRoQj_KRCyvbD48Wmq0VsZBJbmVRIdSuLuNjRq_hX_hf_T1-SQ26OCBjSFFElpfXyA67QcK9Vza5OjTh3neHoCe96jzMcB9-dHxM9moQ |
link.rule.ids | 228,230,781,785,886,12770,21393,27930,33378,33749,43605,43810 |
linkProvider | ProQuest |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwfV3NT8IwFG8UYvTmZ0RRd_BaoNtYNy-GGBAViAlIuC1t95oQdcwxjX--r2XgwcRkpy477LV97_3e14-Qa6ldfCCg0jehG-0lNAo9RUMVyciVjHHbHj0cBf0X_3HWnpUBt2VZVrnWiVZRJwtlYuRNdH29AM-Lx26zD2pYo0x2taTQ2CZV30PH3HSK9-43MRY34OazVTLTju5qivx7_tVAnMMbiOVNArRql_6oYmtfevuk-iwyyA_IFqSHZMeWZarlEbrS0_EIihunkzrdNKHFgkKaOOMMxCvkznSB19wZz9_niE_RnXY6mzmbjiE5ezsmk153ctenJeUBFWhHaZtrASCCCIwj4TPFQGpELCxBHOSKBFqylYSuJ9GkMKlF4LY0E1yD9kJtagRPSCVdpHBKnLYK_ACElKGZQMMjoaVUrvBAI8LhvqqRU_vjcbaaahEbmcRWJjVSX8siLk_0Mv6V_9n_r6_Ibn8yHMSDh9HTOdlzTbuAIVDhdVIp8k-4QCNeyEu7Uz9c8pmV |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=SVSNet%3A+An+End-to-end+Speaker+Voice+Similarity+Assessment+Model&rft.jtitle=arXiv.org&rft.au=Cheng-Hung%2C+Hu&rft.au=Yu-Huai+Peng&rft.au=Yamagishi%2C+Junichi&rft.au=Tsao%2C+Yu&rft.date=2022-02-17&rft.pub=Cornell+University+Library%2C+arXiv.org&rft.eissn=2331-8422&rft_id=info:doi/10.48550%2Farxiv.2107.09392 |