A Review of multi-modal speech emotion recognition and various techniques used to solve emotion recognition on speech data

Recognition of speech emotions is one of the emerging disciplines in artificial intelligence. It can make better decisions in a variety of industries, including healthcare, education, and marketing, by using speech emotion recognition technologies to help us grasp human emotions and behavior. Despit...

Full description

Saved in:
Bibliographic Details
Published in2023 5th International Conference on Inventive Research in Computing Applications (ICIRCA) pp. 577 - 582
Main Authors Nanduri, Venkata Naga Pavani Sai Suchitra, Sagiri, Chinmai, Manasa, S Satya Siva, Sanvithatesh, Raavi, M, Ashwin
Format Conference Proceeding
LanguageEnglish
Published IEEE 03.08.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recognition of speech emotions is one of the emerging disciplines in artificial intelligence. It can make better decisions in a variety of industries, including healthcare, education, and marketing, by using speech emotion recognition technologies to help us grasp human emotions and behavior. Despite challenges, this task is crucial for Human-Computer Interaction (HCI). This study creates a strong and trustworthy emotion recognition system that can be employed in practical contexts as technology and scientific knowledge of emotions evolve. Many fields employ speech emotion recognition, though it does have limitations such as inter-subject variability, noise, ambient influences, over-fitting, and a lack of annotated data. The multi-modal format provides three different input modalities: audio, text, and video. With the aim of offering readers, a modern understanding of the popular field of research, the study rigorously identifies and synthesizes recent relevant literature connected to the numerous design components/methodologies of SER systems. The proposed model leverages information from the encoded data to forecast the sentiment of the retrieved text while it encodes information from audio, video, and text (happy, sad, angry, or neutral). The method helps to minimize computational complexity while attaining competitive results on all tasks.
DOI:10.1109/ICIRCA57980.2023.10220691