Jointly Predicting Emotion, Age, and Country Using Pre-Trained Acoustic Embedding

In this paper, we demonstrated the benefit of using a pre-trained model to extract acoustic embedding to jointly predict (multitask learning) three tasks: emotion, age, and native country. The pre-trained model was trained with wav2vec 2.0 large and robust model on the speech emotion corpus. The emo...

Full description

Saved in:

Bibliographic Details
Published in	2022 10th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW) pp. 1 - 6
Main Authors	Atmaja, Bagus Tris, Zanjabila, Sasou, Akira
Format	Conference Proceeding
Language	English
Published	IEEE 18.10.2022
Subjects	acoustic embedding affective computing age prediction Conferences country prediction Emotion recognition Feature extraction Harmonic analysis Measurement multitask learning Predictive models speech emotion recognition Speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this paper, we demonstrated the benefit of using a pre-trained model to extract acoustic embedding to jointly predict (multitask learning) three tasks: emotion, age, and native country. The pre-trained model was trained with wav2vec 2.0 large and robust model on the speech emotion corpus. The emotion and age tasks were regression problems, while country prediction was a classification task. A single harmonic mean from three metrics was used to evaluate the performance of multitask learning. The classifier was a linear network with two independent layers and shared layers connected to the output layers. This study explores multitask learning on different acoustic features (including the acoustic embedding extracted from a model trained on an affective speech dataset), seed numbers, batch sizes, and waveform normalizations for predicting paralinguistic information from speech.
DOI:	10.1109/ACIIW57231.2022.10085991