Speaker Recognition Based on Pre-Trained Model and Deep Clustering
In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels...
Saved in:
Published in | Proceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
15.07.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 1945-788X |
DOI | 10.1109/ICME57554.2024.10687367 |
Cover
Loading…
Abstract | In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature. |
---|---|
AbstractList | In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature. |
Author | Niu, Mengqi Huang, Hao Hu, Ying Liu, Shuanghong Song, Zhida He, Liang |
Author_xml | – sequence: 1 givenname: Liang surname: He fullname: He, Liang organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 2 givenname: Zhida surname: Song fullname: Song, Zhida organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 3 givenname: Shuanghong surname: Liu fullname: Liu, Shuanghong organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 4 givenname: Mengqi surname: Niu fullname: Niu, Mengqi organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 5 givenname: Ying surname: Hu fullname: Hu, Ying organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 6 givenname: Hao surname: Huang fullname: Huang, Hao organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 |
BookMark | eNo1j8tKw0AYhUdRsNa8geC8QOLcL0sbqxZaFK3grsw4f8ponIRJuvDtDahnc76z-eCco5PUJUDoipKKUmKvV_VmKbWUomKEiYoSZTRX-ggVVlvDJeGWUCmP0YxaIUttzNsZKobhg0zRQljCZ2jx0oP7hIyf4b3bpzjGLuGFGyDgCZ4ylNvsYprmpgvQYpcCvgXocd0ehhFyTPsLdNq4doDir-fo9W65rR_K9eP9qr5Zl5FqNZZWKSFNEKphDlSQRnpiXOOlocFIqnjwwlrS-AC-sRA0Y4Iq75hwTJHA-Bxd_nojAOz6HL9c_t793-Y_Wu9NZA |
ContentType | Conference Proceeding |
DBID | 6IE 6IL CBEJK RIE RIL |
DOI | 10.1109/ICME57554.2024.10687367 |
DatabaseName | IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISBN | 9798350390155 |
EISSN | 1945-788X |
EndPage | 6 |
ExternalDocumentID | 10687367 |
Genre | orig-research |
GrantInformation_xml | – fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809 |
GroupedDBID | 6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IPLJI OCL RIE RIL RNS |
ID | FETCH-LOGICAL-i176t-966458d46f2ae6d585b08afb581d85163db4990fbdebf9ed722416ba24a260d23 |
IEDL.DBID | RIE |
IngestDate | Wed Aug 27 02:20:32 EDT 2025 |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-i176t-966458d46f2ae6d585b08afb581d85163db4990fbdebf9ed722416ba24a260d23 |
PageCount | 6 |
ParticipantIDs | ieee_primary_10687367 |
PublicationCentury | 2000 |
PublicationDate | 2024-July-15 |
PublicationDateYYYYMMDD | 2024-07-15 |
PublicationDate_xml | – month: 07 year: 2024 text: 2024-July-15 day: 15 |
PublicationDecade | 2020 |
PublicationTitle | Proceedings (IEEE International Conference on Multimedia and Expo) |
PublicationTitleAbbrev | ICME |
PublicationYear | 2024 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
SSID | ssj0000744903 |
Score | 1.8790958 |
Snippet | In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a... |
SourceID | ieee |
SourceType | Publisher |
StartPage | 1 |
SubjectTerms | Computational modeling Costs deep clustering loss Feature extraction multi-task learning Multimedia databases Phonetics pseudo-phoneme labels self-supervised learning speaker recognition System performance Training |
Title | Speaker Recognition Based on Pre-Trained Model and Deep Clustering |
URI | https://ieeexplore.ieee.org/document/10687367 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1La8MwDDZrTzt1j4698WFXZ3k4TnJt19INWsrWQm_FjmUYLWkpyaW_frKbdGww2M0xJBjLyifJkj5CnrgMco6CZVIrzjg3Ocv82DDF8zhUeACMo28bT8Rozt8W8aIuVne1MADgks_As0N3l683eWVDZajhIk0ikbRICz23Q7HWMaCCWMgzP6pzuAI_e37tjwdojbjQSci95u0fPCoORoYdMmkWcMgeWXlVqbx8_6s3479XeEa63xV7dHrEonNyAsUF6TSUDbTW4EvS-9iCXOHMe5M4tCloD5FMUxxMd8BmljMCHy1J2prKQtMXgC3tryvbUgE_3iXz4WDWH7GaRoF9Bokobf9NHqeaCxNKEBr9A-Wn0qgYTVW0t0SEUkJMMkqDMhnoxKK6UDLkEp0dHUZXpF1sCrgmNPelAmN0ICRaIonMwKCHmBn8C0QC0uyGdO2eLLeHThnLZjtu_5i_I6dWNDZWGsT3pF3uKnhAkC_VoxPuFwcapZc |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDI5gHOA0HkO8yYFrSx9p2lw3hjbYpgk2abcpaRwJbeqmqb3w63G6dggkJG5ppFZRHPezHdsfIQ9M-ilDwTpSK-YwZlJHeJFxFEujQOEBMCV923DEe1P2MotmVbF6WQsDAGXyGbh2WN7l61Va2FAZajhP4pDH--QAgZ-JbbnWLqSCaMiEF1ZZXL4nHvudYRftkTJ4EjC3fv8Hk0oJJM9NMqqXsM0fWbhFrtz081d3xn-v8Zi0vmv26HiHRidkD7JT0qxJG2ilw2ek_b4GucCZtzp1aJXRNmKZpjgYb8CZWNYIfLQ0aUsqM02fANa0syxsUwX8eItMn7uTTs-piBScDz_mue3AyaJEM24CCVyjh6C8RBoVobGKFhcPUU6ISkZpUEaAji2ucyUDJtHd0UF4ThrZKoMLQlNPKjBG-1yiLRJLAQZ9RGHwPxBySMQladk9ma-3vTLm9XZc_TF_Tw57k-FgPuiPXq_JkRWTjZz60Q1p5JsCbhHyc3VXCvoL6rio5w |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+International+Conference+on+Multimedia+and+Expo%29&rft.atitle=Speaker+Recognition+Based+on+Pre-Trained+Model+and+Deep+Clustering&rft.au=He%2C+Liang&rft.au=Song%2C+Zhida&rft.au=Liu%2C+Shuanghong&rft.au=Niu%2C+Mengqi&rft.date=2024-07-15&rft.pub=IEEE&rft.eissn=1945-788X&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICME57554.2024.10687367&rft.externalDocID=10687367 |