Contextual visual and motion salient fusion framework for action recognition in dark environments

Infrared (IR) human action recognition (AR) exhibits resilience against shifting illumination conditions, changes in appearance, and shadows. It has valuable applications in numerous areas of future sustainable and smart cities including robotics, intelligent systems, security, and transportation. H...

Full description

Saved in:

Bibliographic Details
Published in	Knowledge-based systems Vol. 304; p. 112480
Main Authors	Munsif, Muhammad, Khan, Samee Ullah, Khan, Noman, Hussain, Altaf, Kim, Min Je, Baik, Sung Wook
Format	Journal Article
Language	English
Published	Elsevier B.V 25.11.2024
Subjects	Computer vision Deep learning Human activity recognition Networks fusion Transformer network Deep learning Computer vision Transformer network Human activity recognition Networks fusion
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Infrared (IR) human action recognition (AR) exhibits resilience against shifting illumination conditions, changes in appearance, and shadows. It has valuable applications in numerous areas of future sustainable and smart cities including robotics, intelligent systems, security, and transportation. However, current IR-based recognition approaches predominantly concentrate on spatial or local temporal information and often overlook the potential value of global temporal patterns. This oversight can lead to incomplete representations of body part movements and prevent accurate optimization of a network. Therefore, a contextual-motion coalescence network (CMCNet) is proposed that operates in a streamlined and end-to-end manner for robust action representation in darkness in a near-infrared (NIR) setting. Initially, data are preprocessed to separate foreground, normalized, and resized. The framework employs two parallel modules: the contextual visual features learning module (CVFLM) for local feature extraction, and the temporal optical flow learning module (TOFLM) for acquiring motion dynamics. These modules focus on action-relevant regions used shift window-based operations to ensure accurate interpretation of motion information. The coalescence block harmoniously integrates the contextual and motion features within a unified framework. Finally, the temporal decoder module discriminatively identifies the boundaries of the action sequence. This sequence of steps ensures the synergistic optimization of both CVFLM and TOFLM and thorough competent feature extraction for precise AR. Evaluations of CMCNet are carried out on publicly available datasets, InfAR and NTURGB-D, where superior performance is achieved. Our model yields the highest average precision of 89% and 85% on these datasets, respectively, representing an improvement of 2.25% (on InfAR) compared to conventional methods operating at spatial and optical flow levels which underscores its efficacy.
ISSN:	0950-7051
DOI:	10.1016/j.knosys.2024.112480