AttA-NET: Attention Aggregation Network for Audio-Visual Emotion Recognition

In video-based emotion recognition, effective multi-modal fusion techniques are essential to leverage the complementary relationship between audio and visual modalities. Recent attention-based fusion methods are widely leveraged for capturing modal-shared properties. However, they often ignore the m...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 8030 - 8034
Main Authors Fan, Ruijia, Liu, Hong, Li, Yidi, Guo, Peini, Wang, Guoquan, Wang, Ti
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In video-based emotion recognition, effective multi-modal fusion techniques are essential to leverage the complementary relationship between audio and visual modalities. Recent attention-based fusion methods are widely leveraged for capturing modal-shared properties. However, they often ignore the modal-specific properties of audio and visual modalities and the unalignment of model-shared emotional semantic features. In this paper, an Attention Aggregation Network (AttA-NET) is proposed to address these challenges. An attention aggregation module is proposed to get modal-shared properties effectively. This module comprises similarity-aware enhancement blocks and a contrastive loss that facilitates aligning audio and visual semantic features. Moreover, an auxiliary uni-modal classifier is introduced to obtain modal-specific properties, in which intra-modal discriminative features are fully extracted. Under joint optimization of uni-modal and multi-modal classification loss, modal-specific information can be infused. Extensive experiments on RAVDESS and PKU-ER datasets validate the superiority of AttA-NET. The code is available at: https://github.com/NariFan2002/AttA-NET.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10447640