Sliced Wasserstein weighted multimodal mambavision for emotion recognition
In the current field of physiological signal-based affective computing, the capture of both local and global information from single-modal signals, as well as the effective fusion of multimodal signals, still face significant challenges. Recently, Mamba-based models have attracted widespread attenti...
Saved in:
Published in | Knowledge-based systems Vol. 327; p. 114182 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier B.V
09.10.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In the current field of physiological signal-based affective computing, the capture of both local and global information from single-modal signals, as well as the effective fusion of multimodal signals, still face significant challenges. Recently, Mamba-based models have attracted widespread attention due to their linear complexity and exceptional long-sequence modeling capabilities, yet existing Mamba-based models are primarily designed for single-modal tasks. This study introduces the Sliced Wasserstein Weighted Multimodal (SWWM) MambaVision, a novel multimodal fusion model designed to achieve more effective multimodal integration by leveraging the correlations and complementarities of physiological signals. This model inherits the high computational efficiency of the Mamba’s State Space Model (SSM) and integrates the cross-window connection mechanism to capture the global information of single-modal physiological signals. Furthermore, this study innovatively constructs a dual-stream structure framework to achieve the fusion of multimodal signals. Simultaneously, a weighting mechanism based on the Sliced-Wasserstein (SW) distance is proposed, which fully utilizes the manifold structural features of physiological signals to calculate the distance metric of modal feature matrices, achieving a more flexible and effective multimodal fusion. The method was validated on the Deap and Dreamer datasets, achieving average accuracies of 98.99 % and 97.58 %, respectively. The throughput was improved by 84 %, 22 %, and 8 % compared to Conv-Based, Transformer-Based, and Conv-Transformer-Based multimodal models, respectively. The results thoroughly demonstrate its performance advantages in multimodal physiological signal processing and open up new directions for further research in this field. |
---|---|
ISSN: | 0950-7051 |
DOI: | 10.1016/j.knosys.2025.114182 |