Personalized Speech Emotion Recognition in Human-Robot Interaction Using Vision Transformers

Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Enco...

Full description

Saved in:

Bibliographic Details
Published in	IEEE robotics and automation letters Vol. 10; no. 5; pp. 4890 - 4897
Main Authors	Mishra, Ruchik, Frye, Andrew, Rayguru, Madan M., Popa, Dan O.
Format	Journal Article
Language	English
Published	IEEE 01.05.2025
Subjects	Benchmark testing Computer vision Data models Emotion recognition Frequency conversion human-robot interaction Pipelines Social robots Spectrogram Speech emotion recognition Speech recognition Transformers vision transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Emotions are an essential element in human verbal communication, therefore it is important to understand individuals' affect during human-robot interaction (HRI). This letter investigates the application of vision transformer models, namely ViT (Vision Transformers) and BEiT (Bidirectional Encoder Representations from Pre-Training of Image Transformers) pipelines for Speech Emotion Recognition (SER) in HRI. The focus is to generalize the SER models for individual speech characteristics by fine-tuning these models on benchmark datasets and exploiting ensemble methods. For this purpose, we collected audio data from several human subjects having pseudo-naturalistic conversations with the NAO social robot. We then fine-tuned our ViT and BEiT-based models and tested these models on unseen speech samples from the participants in order to dentify four primary emotions from speech: neutral, happy, sad, and angry. The results show that fine-tuning vision transformers on benchmark datasets and then using either these already fine-tuned models or ensembling ViT/BEiT models results in higher classification accuracies than fine-tuning vanilla-ViTs or BEiTs.
ISSN:	2377-3766 2377-3766
DOI:	10.1109/LRA.2025.3554949