Cross-attention-based hybrid ViT-CNN fusion network for action recognition in visible and infrared videos

Human action recognition (HAR) in videos is a critical task in computer vision, but traditional methods relying solely on visible (RGB) data face challenges in low-light or occluded scenarios. Infrared (IR) imagery offers robustness in such conditions, yet effectively fusing IR and visible modalitie...

Full description

Saved in:
Bibliographic Details
Published inPattern analysis and applications : PAA Vol. 28; no. 3
Main Authors Imran, Javed, Gupta, Himanshu
Format Journal Article
LanguageEnglish
Published Heidelberg Springer Nature B.V 01.09.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Human action recognition (HAR) in videos is a critical task in computer vision, but traditional methods relying solely on visible (RGB) data face challenges in low-light or occluded scenarios. Infrared (IR) imagery offers robustness in such conditions, yet effectively fusing IR and visible modalities remains an open problem. To address this, we propose HVCCA-Net, Hybrid ViT-CNN Cross-Attention Network that integrates the strengths of both modalities. Our framework consists of three key modules: (1) a video pre-processing (VPP) module that extracts IR/visible frames, stacked dense flow, and residual images; (2) an intra-modality spatio-temporal feature learning (ISTFL) module combining Inflated 3D CNN (I3D), Group Propagation Vision Transformer (GPViT), and Bi-directional Long Short-Term Memory (BiLSTM) to capture local and global features; and (3) a cross-modality multi-head attention fusion (CMHAF) module that dynamically aligns and fuses complementary features. Experiments on the Infrared-Visible dataset demonstrate state-of-the-art performance (96.0% accuracy), outperforming existing methods. The results highlight the effectiveness of our cross-attention mechanism in leveraging multimodal data for robust action recognition. The code and datasets of the proposed method are available at https://github.com/jvdgit/IR-Vis-Action-Recognition.git
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1433-7541
1433-755X
DOI:10.1007/s10044-025-01493-y