AttA-NET: Attention Aggregation Network for Audio-Visual Emotion Recognition
In video-based emotion recognition, effective multi-modal fusion techniques are essential to leverage the complementary relationship between audio and visual modalities. Recent attention-based fusion methods are widely leveraged for capturing modal-shared properties. However, they often ignore the m...
Saved in:
Published in | ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 8030 - 8034 |
---|---|
Main Authors | , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
14.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In video-based emotion recognition, effective multi-modal fusion techniques are essential to leverage the complementary relationship between audio and visual modalities. Recent attention-based fusion methods are widely leveraged for capturing modal-shared properties. However, they often ignore the modal-specific properties of audio and visual modalities and the unalignment of model-shared emotional semantic features. In this paper, an Attention Aggregation Network (AttA-NET) is proposed to address these challenges. An attention aggregation module is proposed to get modal-shared properties effectively. This module comprises similarity-aware enhancement blocks and a contrastive loss that facilitates aligning audio and visual semantic features. Moreover, an auxiliary uni-modal classifier is introduced to obtain modal-specific properties, in which intra-modal discriminative features are fully extracted. Under joint optimization of uni-modal and multi-modal classification loss, modal-specific information can be infused. Extensive experiments on RAVDESS and PKU-ER datasets validate the superiority of AttA-NET. The code is available at: https://github.com/NariFan2002/AttA-NET. |
---|---|
ISSN: | 2379-190X |
DOI: | 10.1109/ICASSP48485.2024.10447640 |