Sliced Wasserstein weighted multimodal mambavision for emotion recognition

In the current field of physiological signal-based affective computing, the capture of both local and global information from single-modal signals, as well as the effective fusion of multimodal signals, still face significant challenges. Recently, Mamba-based models have attracted widespread attenti...

Full description

Saved in:
Bibliographic Details
Published inKnowledge-based systems Vol. 327; p. 114182
Main Authors Wang, Hao, Xu, Li, Ding, Weiyue, Xu, Yiming
Format Journal Article
LanguageEnglish
Published Elsevier B.V 09.10.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In the current field of physiological signal-based affective computing, the capture of both local and global information from single-modal signals, as well as the effective fusion of multimodal signals, still face significant challenges. Recently, Mamba-based models have attracted widespread attention due to their linear complexity and exceptional long-sequence modeling capabilities, yet existing Mamba-based models are primarily designed for single-modal tasks. This study introduces the Sliced Wasserstein Weighted Multimodal (SWWM) MambaVision, a novel multimodal fusion model designed to achieve more effective multimodal integration by leveraging the correlations and complementarities of physiological signals. This model inherits the high computational efficiency of the Mamba’s State Space Model (SSM) and integrates the cross-window connection mechanism to capture the global information of single-modal physiological signals. Furthermore, this study innovatively constructs a dual-stream structure framework to achieve the fusion of multimodal signals. Simultaneously, a weighting mechanism based on the Sliced-Wasserstein (SW) distance is proposed, which fully utilizes the manifold structural features of physiological signals to calculate the distance metric of modal feature matrices, achieving a more flexible and effective multimodal fusion. The method was validated on the Deap and Dreamer datasets, achieving average accuracies of 98.99 % and 97.58 %, respectively. The throughput was improved by 84 %, 22 %, and 8 % compared to Conv-Based, Transformer-Based, and Conv-Transformer-Based multimodal models, respectively. The results thoroughly demonstrate its performance advantages in multimodal physiological signal processing and open up new directions for further research in this field.
ISSN:0950-7051
DOI:10.1016/j.knosys.2025.114182