STERM: A Multimodal Speech Emotion Recognition Model in Filipino Gaming Settings

Gaming is highly connected to emotion. Unfortunately, most game experience research has little or no connection to the emotion literature, which makes the emotion in games poorly understood. As technology and understanding of emotion are progressing, the researchers would like to take the opportunit...

Full description

Saved in:
Bibliographic Details
Published in2022 IEEE 14th International Conference on Humanoid, Nanotechnology, Information Technology, Communication and Control, Environment, and Management (HNICEM) pp. 1 - 6
Main Authors Magno, Giorgio Armani G., Cuchapin, Lhuijee Jhulo V., Estrada, Jheanel E.
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Gaming is highly connected to emotion. Unfortunately, most game experience research has little or no connection to the emotion literature, which makes the emotion in games poorly understood. As technology and understanding of emotion are progressing, the researchers would like to take the opportunity to unfold discoveries that relate to recognizing the underlying emotions while playing Valorant, one of the trending online games nowadays. To recognize emotions, a model for speech emotion recognition must be developed. For emotion recognition in human speech, one can either extract emotion-related attributes from speech data or translate the speech dataset into its text equivalence prior to analyzing the data using natural language processing. Furthermore, emotion detection will benefit from the use of an audio-textual multimodal set-up, but it is not easily possible to devise a system that can learn from multimodality. It is either one can independently construct models for two input sources and aggregate them at the decision level. Inspired by this idea, the researchers in this paper proposed a speech emotion recognition model utilizing two modalities: speech and text. This study aims to discover the performance of a natural speech database consisting of in-game audio communications of Filipino gamers in multimodal emotion recognition and also, to detect the profane words uttered using audio and textual features. Employing deep learning algorithms like Convolutional Neural Networks (CNN) for speech and Natural Language Processing for recognizing emotions from the text as well as detecting the profane words that existed, results were evaluated in accordance with its statistical measures and then combined in order to evaluate the results and show the proposed approach achieves the state-of-the-art performance on the natural speech database.
ISSN:2770-0682
DOI:10.1109/HNICEM57413.2022.10109472