Personalized Speech Emotion Recognition in Human-Robot Interaction Using Vision Transformers

Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Enco...

Full description

Saved in:
Bibliographic Details
Published inIEEE robotics and automation letters Vol. 10; no. 5; pp. 4890 - 4897
Main Authors Mishra, Ruchik, Frye, Andrew, Rayguru, Madan M., Popa, Dan O.
Format Journal Article
LanguageEnglish
Published IEEE 01.05.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Encoder Representations from Pre-Training of Image Transformers) pipelines for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from several human subjects having pseudo-naturalistic conversations with the NAO social robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants in order to dentify four primary emotions from speech: neutral, happy, sad, and angry. The results show that fine-tuning vision transformers on benchmark datasets and then using either these already fine-tuned models or ensembling ViT/BEiT models results in higher classification accuracies than fine-tuning vanilla-ViTs or BEiTs.
ISSN:2377-3766
2377-3766
DOI:10.1109/LRA.2025.3554949