Exploring Wav2vec 2.0 Fine Tuning for Improved Speech Emotion Recognition

While Wav2Vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 1 - 5
Main Authors	Chen, Li-Wei, Rudnicky, Alexander
Format	Conference Proceeding
Language	English
Published	IEEE 04.06.2023
Subjects	Codes deep neural networks Emotion recognition fine-tuning pretrained models Signal processing Signal processing algorithms Speech coding Speech emotion recognition Speech recognition Task analysis wav2vec 2.0
Online Access	Get full text

Cover

Loading…

More Information
Summary:	While Wav2Vec 2.0 has been proposed for speech recognition (ASR), it can also be used for speech emotion recognition (SER); its performance can be significantly improved using different fine-tuning strategies. Two baseline methods, vanilla fine-tuning (V-FT) and task adaptive pretraining (TAPT) are first presented. We show that V-FT is able to outperform state-of-the-art models on the IEMOCAP dataset. TAPT, an existing NLP fine-tuning strategy, further improves the performance on SER. We also introduce a novel fine-tuning method termed P-TAPT, which modifies the TAPT objective to learn contextualized emotion representations. Experiments show that P-TAPT performs better than TAPT, especially under low-resource settings. Compared to prior works in this literature, our top-line system achieved a 7.4% absolute improvement in unweighted accuracy (UA) over the state-of-the-art performance on IEMOCAP. Our code is publicly available. 1
ISSN:	2379-190X
DOI:	10.1109/ICASSP49357.2023.10095036