Speaker Recognition Based on Pre-Trained Model and Deep Clustering
In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels...
Saved in:
Published in | Proceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
15.07.2024
|
Subjects | |
Online Access | Get full text |
ISSN | 1945-788X |
DOI | 10.1109/ICME57554.2024.10687367 |
Cover
Loading…
Summary: | In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature. |
---|---|
ISSN: | 1945-788X |
DOI: | 10.1109/ICME57554.2024.10687367 |