Speaker Recognition Based on Pre-Trained Model and Deep Clustering

In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6
Main Authors	He, Liang, Song, Zhida, Liu, Shuanghong, Niu, Mengqi, Hu, Ying, Huang, Hao
Format	Conference Proceeding
Language	English
Published	IEEE 15.07.2024
Subjects	Computational modeling Costs deep clustering loss Feature extraction multi-task learning Multimedia databases Phonetics pseudo-phoneme labels self-supervised learning speaker recognition System performance Training
Online Access	Get full text
ISSN	1945-788X
DOI	10.1109/ICME57554.2024.10687367

Cover

Loading…

Abstract	In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature.
AbstractList	In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature.
Author	Niu, Mengqi Huang, Hao Hu, Ying Liu, Shuanghong Song, Zhida He, Liang
Author_xml	– sequence: 1 givenname: Liang surname: He fullname: He, Liang organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 2 givenname: Zhida surname: Song fullname: Song, Zhida organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 3 givenname: Shuanghong surname: Liu fullname: Liu, Shuanghong organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 4 givenname: Mengqi surname: Niu fullname: Niu, Mengqi organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 5 givenname: Ying surname: Hu fullname: Hu, Ying organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017 – sequence: 6 givenname: Hao surname: Huang fullname: Huang, Hao organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017
BookMark	eNo1j8tKw0AYhUdRsNa8geC8QOLcL0sbqxZaFK3grsw4f8ponIRJuvDtDahnc76z-eCco5PUJUDoipKKUmKvV_VmKbWUomKEiYoSZTRX-ggVVlvDJeGWUCmP0YxaIUttzNsZKobhg0zRQljCZ2jx0oP7hIyf4b3bpzjGLuGFGyDgCZ4ylNvsYprmpgvQYpcCvgXocd0ehhFyTPsLdNq4doDir-fo9W65rR_K9eP9qr5Zl5FqNZZWKSFNEKphDlSQRnpiXOOlocFIqnjwwlrS-AC-sRA0Y4Iq75hwTJHA-Bxd_nojAOz6HL9c_t793-Y_Wu9NZA
ContentType	Conference Proceeding
DBID	6IE 6IL CBEJK RIE RIL
DOI	10.1109/ICME57554.2024.10687367
DatabaseName	IEEE Electronic Library (IEL) Conference Proceedings IEEE Xplore POP ALL IEEE Xplore All Conference Proceedings IEEE Electronic Library (IEL) IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISBN	9798350390155
EISSN	1945-788X
EndPage	6
ExternalDocumentID	10687367
Genre	orig-research
GrantInformation_xml	– fundername: National Natural Science Foundation of China funderid: 10.13039/501100001809
GroupedDBID	6IE 6IF 6IK 6IL 6IN AAJGR AAWTH ADZIZ ALMA_UNASSIGNED_HOLDINGS BEFXN BFFAM BGNUA BKEBE BPEOZ CBEJK CHZPO IPLJI OCL RIE RIL RNS
ID	FETCH-LOGICAL-i176t-966458d46f2ae6d585b08afb581d85163db4990fbdebf9ed722416ba24a260d23
IEDL.DBID	RIE
IngestDate	Wed Aug 27 02:20:32 EDT 2025
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-i176t-966458d46f2ae6d585b08afb581d85163db4990fbdebf9ed722416ba24a260d23
PageCount	6
ParticipantIDs	ieee_primary_10687367
PublicationCentury	2000
PublicationDate	2024-July-15
PublicationDateYYYYMMDD	2024-07-15
PublicationDate_xml	– month: 07 year: 2024 text: 2024-July-15 day: 15
PublicationDecade	2020
PublicationTitle	Proceedings (IEEE International Conference on Multimedia and Expo)
PublicationTitleAbbrev	ICME
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
SSID	ssj0000744903
Score	1.8790958
Snippet	In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a...
SourceID	ieee
SourceType	Publisher
StartPage	1
SubjectTerms	Computational modeling Costs deep clustering loss Feature extraction multi-task learning Multimedia databases Phonetics pseudo-phoneme labels self-supervised learning speaker recognition System performance Training
Title	Speaker Recognition Based on Pre-Trained Model and Deep Clustering
URI	https://ieeexplore.ieee.org/document/10687367
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1La8MwDDZrTzt1j4698WFXZ3k4TnJt19INWsrWQm_FjmUYLWkpyaW_frKbdGww2M0xJBjLyifJkj5CnrgMco6CZVIrzjg3Ocv82DDF8zhUeACMo28bT8Rozt8W8aIuVne1MADgks_As0N3l683eWVDZajhIk0ikbRICz23Q7HWMaCCWMgzP6pzuAI_e37tjwdojbjQSci95u0fPCoORoYdMmkWcMgeWXlVqbx8_6s3479XeEa63xV7dHrEonNyAsUF6TSUDbTW4EvS-9iCXOHMe5M4tCloD5FMUxxMd8BmljMCHy1J2prKQtMXgC3tryvbUgE_3iXz4WDWH7GaRoF9Bokobf9NHqeaCxNKEBr9A-Wn0qgYTVW0t0SEUkJMMkqDMhnoxKK6UDLkEp0dHUZXpF1sCrgmNPelAmN0ICRaIonMwKCHmBn8C0QC0uyGdO2eLLeHThnLZjtu_5i_I6dWNDZWGsT3pF3uKnhAkC_VoxPuFwcapZc
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDI5gHOA0HkO8yYFrSx9p2lw3hjbYpgk2abcpaRwJbeqmqb3w63G6dggkJG5ppFZRHPezHdsfIQ9M-ilDwTpSK-YwZlJHeJFxFEujQOEBMCV923DEe1P2MotmVbF6WQsDAGXyGbh2WN7l61Va2FAZajhP4pDH--QAgZ-JbbnWLqSCaMiEF1ZZXL4nHvudYRftkTJ4EjC3fv8Hk0oJJM9NMqqXsM0fWbhFrtz081d3xn-v8Zi0vmv26HiHRidkD7JT0qxJG2ilw2ek_b4GucCZtzp1aJXRNmKZpjgYb8CZWNYIfLQ0aUsqM02fANa0syxsUwX8eItMn7uTTs-piBScDz_mue3AyaJEM24CCVyjh6C8RBoVobGKFhcPUU6ISkZpUEaAji2ucyUDJtHd0UF4ThrZKoMLQlNPKjBG-1yiLRJLAQZ9RGHwPxBySMQladk9ma-3vTLm9XZc_TF_Tw57k-FgPuiPXq_JkRWTjZz60Q1p5JsCbhHyc3VXCvoL6rio5w
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+International+Conference+on+Multimedia+and+Expo%29&rft.atitle=Speaker+Recognition+Based+on+Pre-Trained+Model+and+Deep+Clustering&rft.au=He%2C+Liang&rft.au=Song%2C+Zhida&rft.au=Liu%2C+Shuanghong&rft.au=Niu%2C+Mengqi&rft.date=2024-07-15&rft.pub=IEEE&rft.eissn=1945-788X&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICME57554.2024.10687367&rft.externalDocID=10687367