MAVEN: A Memory Augmented Recurrent Approach for Multimodal Fusion

Multisensory systems provide complementary information that aids many machine learning approaches in perceiving the environment comprehensively. These systems consist of heterogeneous modalities, which have disparate characteristics and feature distributions. Thus, extracting, aligning, and fusing c...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 25; pp. 3694 - 3708
Main Authors	Islam, Md Mofijul, Yasar, Mohammad Samin, Iqbal, Tariq
Format	Journal Article
Language	English
Published	Piscataway IEEE 2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Ablation Data mining Datasets Deep learning Feature extraction Fuses Human activity recognition Machine learning Memory multimodal learning Noise measurement Representations Robustness Sensors Task analysis Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Multisensory systems provide complementary information that aids many machine learning approaches in perceiving the environment comprehensively. These systems consist of heterogeneous modalities, which have disparate characteristics and feature distributions. Thus, extracting, aligning, and fusing complementary representations from heterogeneous modalities (e.g., visual, skeleton, and physical sensors) remains challenging. To address these challenges, we have used the insights from several neuroscience studies of animal multisensory systems to develop MAVEN, a memory-augmented recurrent approach for multimodal fusion. MAVEN generates unimodal memory banks comprised of spatial-temporal features and uses our proposed recurrent representation alignment approach to align and refine unimodal representations iteratively. MAVEN then utilizes a multimodal variational attention-based fusion approach to produce a robust multimodal representation from the aligned unimodal features. Our extensive experimental evaluations on three multimodal datasets suggest that MAVEN outperforms state-of-the-art multimodal learning approaches in the challenging human activity recognition task across all evaluation conditions (cross-subject, leave-one-subject-out, and cross-session). Additionally, our extensive ablation studies suggest that MAVEN significantly outperforms the feed-forward fusion-based learning models <inline-formula><tex-math notation="LaTeX">(p< 0.05)</tex-math></inline-formula>. Finally, the robust performance of MAVEN in extracting complementary multimodal representation from occluded and noisy data suggests its applicability on real-world datasets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2022.3164261