Towards Asynchronous Multimodal Signal Interaction and Fusion via Tailored Transformers

The signals from human expressions are usually multimodal, including natural language, facial gestures, and acoustic behaviors. A key challenge is how to fuse multimodal time-series signals with temporal asynchrony. To this end, we present a Transformer-driven Signal Interaction and Fusion (TSIF) ap...

Full description

Saved in:
Bibliographic Details
Published inIEEE signal processing letters Vol. 31; pp. 1550 - 1554
Main Authors Yang, Dingkang, Kuang, Haopeng, Yang, Kun, Li, Mingcheng, Zhang, Lihua
Format Journal Article
LanguageEnglish
Published New York IEEE 2024
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:The signals from human expressions are usually multimodal, including natural language, facial gestures, and acoustic behaviors. A key challenge is how to fuse multimodal time-series signals with temporal asynchrony. To this end, we present a Transformer-driven Signal Interaction and Fusion (TSIF) approach to effectively model asynchronous multimodal signal sequences. TSIF consists of linear and cross-modal transformer modules with different duties. The linear transformer module efficiently performs the global interaction for multimodal signals, and the vital philosophy is to replace the dot product similarity with the Exponential Kernel while achieving linear complexity by a low-rank matrix decomposition. By targeting the language modality, the cross-modal transformer module aims to capture reliable element correlations among distinct signals and mitigate noise interference in audio and visual modalities. Numerous experiments on two multimodal benchmarks show that our TSIF comparably outperforms previous state-of-the-art models with lower space-time complexities. The systematic analysis also proves the effectiveness of the proposed modules.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1070-9908
1558-2361
DOI:10.1109/LSP.2024.3409211