Speaker Recognition Based on Pre-Trained Model and Deep Clustering

In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE International Conference on Multimedia and Expo) pp. 1 - 6
Main Authors	He, Liang, Song, Zhida, Liu, Shuanghong, Niu, Mengqi, Hu, Ying, Huang, Hao
Format	Conference Proceeding
Language	English
Published	IEEE 15.07.2024
Subjects	Computational modeling Costs deep clustering loss Feature extraction multi-task learning Multimedia databases Phonetics pseudo-phoneme labels self-supervised learning speaker recognition System performance Training
Online Access	Get full text
ISSN	1945-788X
DOI	10.1109/ICME57554.2024.10687367

Cover

Loading…

More Information
Summary:	In this paper, we propose a novel loss by integrating a deep clustering (DC) loss at the frame-level and a speaker recognition loss at the segment-level into a single network without additional data requirements and exhaustive computation. The DC loss implicitly generates soft pseudo-phoneme labels for each frame-level feature, which facilitates extracting more discriminant speaker representation by suppressing phonetic content information. We study the DC loss not only on the acoustic feature, but also on the features extracted by the pre-trained models, such as wav2vec 2.0, HuBERT and WavLM. Experimental results on the VoxCeleb dataset shows that the overall system performance based on the pre-trained model features are better than the one on the acoustic feature. The proposed loss is significantly effective for systems on the acoustic feature and has a marginal improvement for systems on the pre-trained model feature.
ISSN:	1945-788X
DOI:	10.1109/ICME57554.2024.10687367