Cross-attention-based hybrid ViT-CNN fusion network for action recognition in visible and infrared videos

Human action recognition (HAR) in videos is a critical task in computer vision, but traditional methods relying solely on visible (RGB) data face challenges in low-light or occluded scenarios. Infrared (IR) imagery offers robustness in such conditions, yet effectively fusing IR and visible modalitie...

Full description

Saved in:

Bibliographic Details
Published in	Pattern analysis and applications : PAA Vol. 28; no. 3
Main Authors	Imran, Javed, Gupta, Himanshu
Format	Journal Article
Language	English
Published	Heidelberg Springer Nature B.V 01.09.2025
Subjects	Artificial neural networks Computer vision Datasets Human activity recognition Infrared imagery Infrared imaging Machine learning Modules Video
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Human action recognition (HAR) in videos is a critical task in computer vision, but traditional methods relying solely on visible (RGB) data face challenges in low-light or occluded scenarios. Infrared (IR) imagery offers robustness in such conditions, yet effectively fusing IR and visible modalities remains an open problem. To address this, we propose HVCCA-Net, Hybrid ViT-CNN Cross-Attention Network that integrates the strengths of both modalities. Our framework consists of three key modules: (1) a video pre-processing (VPP) module that extracts IR/visible frames, stacked dense flow, and residual images; (2) an intra-modality spatio-temporal feature learning (ISTFL) module combining Inflated 3D CNN (I3D), Group Propagation Vision Transformer (GPViT), and Bi-directional Long Short-Term Memory (BiLSTM) to capture local and global features; and (3) a cross-modality multi-head attention fusion (CMHAF) module that dynamically aligns and fuses complementary features. Experiments on the Infrared-Visible dataset demonstrate state-of-the-art performance (96.0% accuracy), outperforming existing methods. The results highlight the effectiveness of our cross-attention mechanism in leveraging multimodal data for robust action recognition. The code and datasets of the proposed method are available at https://github.com/jvdgit/IR-Vis-Action-Recognition.git
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1433-7541 1433-755X
DOI:	10.1007/s10044-025-01493-y