Semantic-Aware Frame-Event Fusion based Pattern Recognition via Large Vision-Language Models
Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recog...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
30.11.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Pattern recognition through the fusion of RGB frames and Event streams has
emerged as a novel research area in recent years. Current methods typically
employ backbone networks to individually extract the features of RGB frames and
event streams, and subsequently fuse these features for pattern recognition.
However, we posit that these methods may suffer from key issues like sematic
gaps and small-scale backbone networks. In this study, we introduce a novel
pattern recognition framework that consolidates the semantic labels, RGB
frames, and event streams, leveraging pre-trained large-scale vision-language
models. Specifically, given the input RGB frames, event streams, and all the
predefined semantic labels, we employ a pre-trained large-scale vision model
(CLIP vision encoder) to extract the RGB and event features. To handle the
semantic labels, we initially convert them into language descriptions through
prompt engineering, and then obtain the semantic features using the pre-trained
large-scale language model (CLIP text encoder). Subsequently, we integrate the
RGB/Event features and semantic features using multimodal Transformer networks.
The resulting frame and event tokens are further amplified using self-attention
layers. Concurrently, we propose to enhance the interactions between text
tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all
three modalities using self-attention and feed-forward layers for recognition.
Comprehensive experiments on the HARDVS and PokerEvent datasets fully
substantiate the efficacy of our proposed SAFE model. The source code will be
made available at https://github.com/Event-AHU/SAFE_LargeVLM. |
---|---|
DOI: | 10.48550/arxiv.2311.18592 |