Contextual xLSTM-based multimodal fusion for conversational emotion recognition
In real-world dialogue systems, the ability to understand user emotions and engage in human-like interactions is of great significance. Emotion Recognition in Conversation (ERC) is one of the key approaches to achieving this goal and has attracted increasing attention. However, existing methods for...
Saved in:
Published in | Pattern analysis and applications : PAA Vol. 28; no. 3 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
London
Springer London
01.09.2025
Springer Nature B.V |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In real-world dialogue systems, the ability to understand user emotions and engage in human-like interactions is of great significance. Emotion Recognition in Conversation (ERC) is one of the key approaches to achieving this goal and has attracted increasing attention. However, existing methods for ERC often fail to effectively model contextual information and exploit the complementarity of multimodal information. Few approaches are capable of fully capturing the complex correlations and mapping relationships between different modalities. Additionally, classifying minority classes and semantically similar classes remains a significant challenge. To address these issues, this paper proposes an attention-based contextual modeling and multimodal fusion network. The proposed method efficiently combines an extended LSTM network (xLSTM) with an attention mechanism to thoroughly model the contextual information generated during conversations. xLSTM is an enhanced LSTM unit featuring matrix memory and exponential gating mechanisms, which can better capture long-range dependencies and improve recognition performance. Furthermore, a Transformer-based modality encoder is employed to map features from different modalities into a shared feature space, enabling alignment and mutual enhancement among modalities. This facilitates both intra-modal and inter-modal information interaction, thereby maximizing the complementary advantages of multimodal data. A multimodal fusion module based on bidirectional multi-head cross-attention layers is then used to capture cross-modal mapping relationships among text, audio, and visual modalities, effectively integrating multimodal information. Extensive experiments conducted on two benchmark ERC datasets, IEMOCAP and MELD, demonstrate that the proposed method achieves weighted F1 scores of 75.21 and 69.78, respectively, outperforming the current state-of-the-art methods by 2.2 and 3.3 points. It also achieves accuracy rates of 80.59% and 69.16% on the two datasets, representing improvements of 6.64% and 1.11%, respectively. The codes and models are available at:
https://github.com/rhoqwomda/MFCRE |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 1433-7541 1433-755X |
DOI: | 10.1007/s10044-025-01508-8 |