Selective HuBERT: Self-Supervised Pre-Training for Target Speaker in Clean and Mixture Speech
Self-supervised pre-trained speech models were shown effective for various downstream speech processing tasks. Since they are mainly pre-trained to map input speech to pseudo-labels, the resulting representations are only effective for the type of pre-train data used, either clean or mixture speech....
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
08.11.2023
|
Online Access | Get full text |
Cover
Loading…
Summary: | Self-supervised pre-trained speech models were shown effective for various
downstream speech processing tasks. Since they are mainly pre-trained to map
input speech to pseudo-labels, the resulting representations are only effective
for the type of pre-train data used, either clean or mixture speech. With the
idea of selective auditory attention, we propose a novel pre-training solution
called Selective-HuBERT, or SHuBERT, which learns the selective extraction of
target speech representations from either clean or mixture speech.
Specifically, SHuBERT is trained to predict pseudo labels of a target speaker,
conditioned on an enrolled speech from the target speaker. By doing so, SHuBERT
is expected to selectively attend to the target speaker in a complex acoustic
environment, thus benefiting various downstream tasks. We further introduce a
dual-path training strategy and use the cross-correlation constraint between
the two branches to encourage the model to generate noise-invariant
representation. Experiments on SUPERB benchmark and LibriMix dataset
demonstrate the universality and noise-robustness of SHuBERT. Furthermore, we
find that our high-quality representation can be easily integrated with
conventional supervised learning methods to achieve significant performance,
even under extremely low-resource labeled data. |
---|---|
DOI: | 10.48550/arxiv.2311.04526 |