Pre-Finetuning for Few-Shot Emotional Speech Recognition
Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning probl...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
24.02.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Speech models have long been known to overfit individual speakers for many
classification tasks. This leads to poor generalization in settings where the
speakers are out-of-domain or out-of-distribution, as is common in production
environments. We view speaker adaptation as a few-shot learning problem and
propose investigating transfer learning approaches inspired by recent success
with pre-trained models in natural language tasks. We propose pre-finetuning
speech models on difficult tasks to distill knowledge into few-shot downstream
classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of
four multiclass emotional speech recognition corpora and evaluate our
pre-finetuned models through 33,600 few-shot fine-tuning trials on the
Emotional Speech Dataset. |
---|---|
DOI: | 10.48550/arxiv.2302.12921 |