SA-WavLM: Speaker-Aware Self-Supervised Pre-training for Mixture Speech

It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixt...

Full description

Saved in:

Bibliographic Details
Published in	arXiv.org
Main Authors	Lin, Jingru, Ge, Meng, Ao, Junyi, Deng, Liqun, Li, Haizhou
Format	Paper
Language	English
Published	Ithaca Cornell University Library, arXiv.org 03.07.2024
Subjects	Mixtures Self-supervised learning Speech
Online Access	Get full text

Cover

Loading…

More Information
Summary:	It was shown that pre-trained models with self-supervised learning (SSL) techniques are effective in various downstream speech tasks. However, most such models are trained on single-speaker speech data, limiting their effectiveness in mixture speech. This motivates us to explore pre-training on mixture speech. This work presents SA-WavLM, a novel pre-trained model for mixture speech. Specifically, SA-WavLM follows an "extract-merge-predict" pipeline in which the representations of each speaker in the input mixture are first extracted individually and then merged before the final prediction. In this pipeline, SA-WavLM performs speaker-informed extractions with the consideration of the interactions between different speakers. Furthermore, a speaker shuffling strategy is proposed to enhance the robustness towards the speaker absence. Experiments show that SA-WavLM either matches or improves upon the state-of-the-art pre-trained models.
ISSN:	2331-8422