Pre-Finetuning for Few-Shot Emotional Speech Recognition

Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning probl...

Full description

Saved in:

Bibliographic Details
Main Authors	Chen, Maximillian, Yu, Zhou
Format	Journal Article
Language	English
Published	24.02.2023
Subjects	Computer Science - Computation and Language Computer Science - Learning Computer Science - Sound
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Speech models have long been known to overfit individual speakers for many classification tasks. This leads to poor generalization in settings where the speakers are out-of-domain or out-of-distribution, as is common in production environments. We view speaker adaptation as a few-shot learning problem and propose investigating transfer learning approaches inspired by recent success with pre-trained models in natural language tasks. We propose pre-finetuning speech models on difficult tasks to distill knowledge into few-shot downstream classification objectives. We pre-finetune Wav2Vec2.0 on every permutation of four multiclass emotional speech recognition corpora and evaluate our pre-finetuned models through 33,600 few-shot fine-tuning trials on the Emotional Speech Dataset.
DOI:	10.48550/arxiv.2302.12921