Multimodal and Temporal Perception of Audio-visual Cues for Emotion Recognition

In Audio-Video Emotion Recognition (AVER), the idea is to have a human-level understanding of emotions from video clips. There is a need to bring these two modalities into a unified framework, to effectively learn multimodal fusion for AVER. In addition, literature studies lack in-depth analysis and...

Full description

Saved in:

Bibliographic Details
Published in	International Conference on Affective Computing and Intelligent Interaction and workshops pp. 552 - 558
Main Authors	Ghaleb, Esam, Popa, Mirela, Asteriadis, Stylianos
Format	Conference Proceeding
Language	English
Published	IEEE 01.09.2019
Subjects	Affective computing audio-video emotion recognition deep metric learning Emotion recognition Logic gates Machine learning Measurement multimodal and incremental learning Task analysis Visualization
Online Access	Get full text
ISSN	2156-8111
DOI	10.1109/ACII.2019.8925444

Cover

Loading…

More Information
Summary:	In Audio-Video Emotion Recognition (AVER), the idea is to have a human-level understanding of emotions from video clips. There is a need to bring these two modalities into a unified framework, to effectively learn multimodal fusion for AVER. In addition, literature studies lack in-depth analysis and utilization of how emotions vary as a function of time. Psychological and neurological studies show that negative and positive emotions are not recognized at the same speed. In this paper, we propose a novel multimodal temporal deep network framework that embeds video clips using their audio-visual content, onto a metric space, where their gap is reduced and their complementary and supplementary information is explored. We address two research questions, (1) how audio-visual cues contribute to emotion recognition and (2) how temporal information impacts the recognition rate and speed of emotions. The proposed method is evaluated on two datasets, CREMA-D and RAVDESS. The study findings are promising, achieving the state-of-the-art performance on both datasets, and showing a significant impact of multimodal and temporal emotion perception.
ISSN:	2156-8111
DOI:	10.1109/ACII.2019.8925444