Personalized Speech Emotion Recognition in Human-Robot Interaction Using Vision Transformers
Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Enco...
Saved in:
Published in | IEEE robotics and automation letters Vol. 10; no. 5; pp. 4890 - 4897 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
IEEE
01.05.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Encoder Representations from Pre-Training of Image Transformers) pipelines for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from several human subjects having pseudo-naturalistic conversations with the NAO social robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants in order to dentify four primary emotions from speech: neutral, happy, sad, and angry. The results show that fine-tuning vision transformers on benchmark datasets and then using either these already fine-tuned models or ensembling ViT/BEiT models results in higher classification accuracies than fine-tuning vanilla-ViTs or BEiTs. |
---|---|
ISSN: | 2377-3766 2377-3766 |
DOI: | 10.1109/LRA.2025.3554949 |