Contextual xLSTM-based multimodal fusion for conversational emotion recognition

In real-world dialogue systems, the ability to understand user emotions and engage in human-like interactions is of great significance. Emotion Recognition in Conversation (ERC) is one of the key approaches to achieving this goal and has attracted increasing attention. However, existing methods for...

Full description

Saved in:

Bibliographic Details
Published in	Pattern analysis and applications : PAA Vol. 28; no. 3
Main Authors	Qi, Yupeng, Ibrayim, Mayire, Tohti, Turdi
Format	Journal Article
Language	English
Published	London Springer London 01.09.2025 Springer Nature B.V
Subjects	Audio data Computer Science Datasets Emotion recognition Emotions Mapping Original Article Pattern Recognition Multi-modal fusion Emotion recognition in conversation Convolutional neural network Multi-modal interaction
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In real-world dialogue systems, the ability to understand user emotions and engage in human-like interactions is of great significance. Emotion Recognition in Conversation (ERC) is one of the key approaches to achieving this goal and has attracted increasing attention. However, existing methods for ERC often fail to effectively model contextual information and exploit the complementarity of multimodal information. Few approaches are capable of fully capturing the complex correlations and mapping relationships between different modalities. Additionally, classifying minority classes and semantically similar classes remains a significant challenge. To address these issues, this paper proposes an attention-based contextual modeling and multimodal fusion network. The proposed method efficiently combines an extended LSTM network (xLSTM) with an attention mechanism to thoroughly model the contextual information generated during conversations. xLSTM is an enhanced LSTM unit featuring matrix memory and exponential gating mechanisms, which can better capture long-range dependencies and improve recognition performance. Furthermore, a Transformer-based modality encoder is employed to map features from different modalities into a shared feature space, enabling alignment and mutual enhancement among modalities. This facilitates both intra-modal and inter-modal information interaction, thereby maximizing the complementary advantages of multimodal data. A multimodal fusion module based on bidirectional multi-head cross-attention layers is then used to capture cross-modal mapping relationships among text, audio, and visual modalities, effectively integrating multimodal information. Extensive experiments conducted on two benchmark ERC datasets, IEMOCAP and MELD, demonstrate that the proposed method achieves weighted F1 scores of 75.21 and 69.78, respectively, outperforming the current state-of-the-art methods by 2.2 and 3.3 points. It also achieves accuracy rates of 80.59% and 69.16% on the two datasets, representing improvements of 6.64% and 1.11%, respectively. The codes and models are available at: https://github.com/rhoqwomda/MFCRE
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1433-7541 1433-755X
DOI:	10.1007/s10044-025-01508-8