Towards Efficient Construction Monitoring: An Empirical Study on Action Recognition Models

Monitoring fatigue is challenging under computer-vision-based action recognition due to the changes in motion patterns caused by fatigue. Particularly in the construction scenario, the motion patterns are unique per trade and longer than daily life actions, causing challenging scenarios. This paper...

Full description

Saved in:

Bibliographic Details
Published in	ISARC. Proceedings of the International Symposium on Automation and Robotics in Construction Vol. 41; pp. 1104 - 1114
Main Authors	Nanduri, Sudheer Kumar, Delhi, Venkata Kumar
Format	Conference Proceeding
Language	English
Published	Waterloo IAARC Publications 01.01.2024
Subjects	Activity recognition Artificial neural networks Computer vision Context Datasets Feature recognition Masonry construction Monitoring Motion perception Pattern recognition Visual tasks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Monitoring fatigue is challenging under computer-vision-based action recognition due to the changes in motion patterns caused by fatigue. Particularly in the construction scenario, the motion patterns are unique per trade and longer than daily life actions, causing challenging scenarios. This paper aims to understand the patterns that can guide the selection of optimal clip durations for aggregating motion features specific to each task. We compare the performance of three action recognition models (I3D, MViT, and VideoMAE) on different construction tasks (excavation, masonry, plastering, etc.) at varying clip lengths. We evaluate the models based on frame-wise accuracy, sequence predictability error, and normalized evaluation duration. Our results show that the transformer-based models outperform the convolutional neural network-based models. The model trained directly over videos performs better than those trained on images. Also, the clip duration affects the model performance differently depending on the task type. Neither the 3s context window from the Atomic Visual Actions (AVA) dataset nor the 10s context window from the Kinetics-400 dataset is suitable for construction tasks. Instead, we suggest a variable clip duration between 5s and 7s, which is preferable depending on the tasks and model architecture. Our work provides insights for developing a dynamic and context-aware duration selection system for action recognition in construction.