Speaker Recognition Based on Pre-Trained Model and Deep Clustering

In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6
Main Authors He, Liang, Song, Zhida, Liu, Shuanghong, Niu, Mengqi, Hu, Ying, Huang, Hao
Format Conference Proceeding
LanguageEnglish
Published IEEE 15.07.2024
Subjects
Online AccessGet full text
ISSN1945-788X
DOI10.1109/ICME57554.2024.10687367

Cover

Loading…
Abstract In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature.
AbstractList In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature.
Author Niu, Mengqi
Huang, Hao
Hu, Ying
Liu, Shuanghong
Song, Zhida
He, Liang
Author_xml – sequence: 1
  givenname: Liang
  surname: He
  fullname: He, Liang
  organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017
– sequence: 2
  givenname: Zhida
  surname: Song
  fullname: Song, Zhida
  organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017
– sequence: 3
  givenname: Shuanghong
  surname: Liu
  fullname: Liu, Shuanghong
  organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017
– sequence: 4
  givenname: Mengqi
  surname: Niu
  fullname: Niu, Mengqi
  organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017
– sequence: 5
  givenname: Ying
  surname: Hu
  fullname: Hu, Ying
  organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017
– sequence: 6
  givenname: Hao
  surname: Huang
  fullname: Huang, Hao
  organization: Xinjiang University,School of Computer Science and Technology,Urumqi,China,830017
BookMark eNo1j8tKw0AYhUdRsNa8geC8QOLcL0sbqxZaFK3grsw4f8ponIRJuvDtDahnc76z-eCco5PUJUDoipKKUmKvV_VmKbWUomKEiYoSZTRX-ggVVlvDJeGWUCmP0YxaIUttzNsZKobhg0zRQljCZ2jx0oP7hIyf4b3bpzjGLuGFGyDgCZ4ylNvsYprmpgvQYpcCvgXocd0ehhFyTPsLdNq4doDir-fo9W65rR_K9eP9qr5Zl5FqNZZWKSFNEKphDlSQRnpiXOOlocFIqnjwwlrS-AC-sRA0Y4Iq75hwTJHA-Bxd_nojAOz6HL9c_t793-Y_Wu9NZA
ContentType Conference Proceeding
DBID 6IE
6IL
CBEJK
RIE
RIL
DOI 10.1109/ICME57554.2024.10687367
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Xplore POP ALL
IEEE Xplore All Conference Proceedings
IEEE Electronic Library (IEL)
IEEE Proceedings Order Plans (POP All) 1998-Present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
EISBN 9798350390155
EISSN 1945-788X
EndPage 6
ExternalDocumentID 10687367
Genre orig-research
GrantInformation_xml – fundername: National Natural Science Foundation of China
  funderid: 10.13039/501100001809
GroupedDBID 6IE
6IF
6IK
6IL
6IN
AAJGR
AAWTH
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IPLJI
OCL
RIE
RIL
RNS
ID FETCH-LOGICAL-i176t-966458d46f2ae6d585b08afb581d85163db4990fbdebf9ed722416ba24a260d23
IEDL.DBID RIE
IngestDate Wed Aug 27 02:20:32 EDT 2025
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-966458d46f2ae6d585b08afb581d85163db4990fbdebf9ed722416ba24a260d23
PageCount 6
ParticipantIDs ieee_primary_10687367
PublicationCentury 2000
PublicationDate 2024-July-15
PublicationDateYYYYMMDD 2024-07-15
PublicationDate_xml – month: 07
  year: 2024
  text: 2024-July-15
  day: 15
PublicationDecade 2020
PublicationTitle Proceedings (IEEE International Conference on Multimedia and Expo)
PublicationTitleAbbrev ICME
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0000744903
Score 1.8790958
Snippet In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a...
SourceID ieee
SourceType Publisher
StartPage 1
SubjectTerms Computational modeling
Costs
deep clustering loss
Feature extraction
multi-task learning
Multimedia databases
Phonetics
pseudo-phoneme labels
self-supervised learning
speaker recognition
System performance
Training
Title Speaker Recognition Based on Pre-Trained Model and Deep Clustering
URI https://ieeexplore.ieee.org/document/10687367
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1La8MwDDZrTzt1j4698WFXZ3k4TnJt19INWsrWQm_FjmUYLWkpyaW_frKbdGww2M0xJBjLyifJkj5CnrgMco6CZVIrzjg3Ocv82DDF8zhUeACMo28bT8Rozt8W8aIuVne1MADgks_As0N3l683eWVDZajhIk0ikbRICz23Q7HWMaCCWMgzP6pzuAI_e37tjwdojbjQSci95u0fPCoORoYdMmkWcMgeWXlVqbx8_6s3479XeEa63xV7dHrEonNyAsUF6TSUDbTW4EvS-9iCXOHMe5M4tCloD5FMUxxMd8BmljMCHy1J2prKQtMXgC3tryvbUgE_3iXz4WDWH7GaRoF9Bokobf9NHqeaCxNKEBr9A-Wn0qgYTVW0t0SEUkJMMkqDMhnoxKK6UDLkEp0dHUZXpF1sCrgmNPelAmN0ICRaIonMwKCHmBn8C0QC0uyGdO2eLLeHThnLZjtu_5i_I6dWNDZWGsT3pF3uKnhAkC_VoxPuFwcapZc
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LT8MwDI5gHOA0HkO8yYFrSx9p2lw3hjbYpgk2abcpaRwJbeqmqb3w63G6dggkJG5ppFZRHPezHdsfIQ9M-ilDwTpSK-YwZlJHeJFxFEujQOEBMCV923DEe1P2MotmVbF6WQsDAGXyGbh2WN7l61Va2FAZajhP4pDH--QAgZ-JbbnWLqSCaMiEF1ZZXL4nHvudYRftkTJ4EjC3fv8Hk0oJJM9NMqqXsM0fWbhFrtz081d3xn-v8Zi0vmv26HiHRidkD7JT0qxJG2ilw2ek_b4GucCZtzp1aJXRNmKZpjgYb8CZWNYIfLQ0aUsqM02fANa0syxsUwX8eItMn7uTTs-piBScDz_mue3AyaJEM24CCVyjh6C8RBoVobGKFhcPUU6ISkZpUEaAji2ucyUDJtHd0UF4ThrZKoMLQlNPKjBG-1yiLRJLAQZ9RGHwPxBySMQladk9ma-3vTLm9XZc_TF_Tw57k-FgPuiPXq_JkRWTjZz60Q1p5JsCbhHyc3VXCvoL6rio5w
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+International+Conference+on+Multimedia+and+Expo%29&rft.atitle=Speaker+Recognition+Based+on+Pre-Trained+Model+and+Deep+Clustering&rft.au=He%2C+Liang&rft.au=Song%2C+Zhida&rft.au=Liu%2C+Shuanghong&rft.au=Niu%2C+Mengqi&rft.date=2024-07-15&rft.pub=IEEE&rft.eissn=1945-788X&rft.spage=1&rft.epage=6&rft_id=info:doi/10.1109%2FICME57554.2024.10687367&rft.externalDocID=10687367