Multi-modal temporal action segmentation for manufacturing scenarios

Industrial robots have become prevalent in manufacturing due to their advantages of accuracy, speed, and reduced operator fatigue. Nevertheless, human operators play a crucial role in primary production lines. This study focuses on the temporal segmentation of human actions, aiming to identify the p...

Full description

Saved in:

Bibliographic Details
Published in	Engineering applications of artificial intelligence Vol. 148; p. 110320
Main Authors	Romeo, Laura, Marani, Roberto, Perri, Anna Gina, Gall, Juergen
Format	Journal Article
Language	English
Published	Elsevier Ltd 15.05.2025
Subjects	Action segmentation Manufacturing Multimodal data Multimodal features Video understanding Multimodal data Action segmentation Manufacturing Video understanding Multimodal features
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Industrial robots have become prevalent in manufacturing due to their advantages of accuracy, speed, and reduced operator fatigue. Nevertheless, human operators play a crucial role in primary production lines. This study focuses on the temporal segmentation of human actions, aiming to identify the physical and cognitive behavior of operators working alongside collaborative robots. While existing literature explores temporal action segmentation datasets, there is a lack of evaluation for manufacturing tasks. This work assesses six state-of-the-art action segmentation models using the Human Action Multi-Modal Monitoring in Manufacturing (HA4M) dataset, where subjects assemble an industrial object in realistic manufacturing scenarios. By employing Cross-Subject and Cross-Location approaches, the study not only demonstrates the effectiveness of these models in industrial settings but also introduces a new benchmark for evaluating generalization across different subjects and locations. The evaluation further includes new videos in simulated industrial locations, assessed with both fully and semi-supervised learning approaches. The findings reveal that the Multi-Stage Temporal Convolutional Network ++ (MS-TCN++) and the Action Segmentation Transformer (ASFormer) architectures exhibit high performance in supervised and semi-supervised learning settings, also using new data, particularly when trained with Skeletal features, advancing the capabilities of temporal action segmentation in real-world manufacturing environments. This research lays the foundation for addressing video activity understanding challenges in manufacturing and presents opportunities for future investigations. [Display omitted] •I3D and Skeletal features extracted from the HA4M dataset for TAS in manufacturing.•Different set splits according to subjects and settings to understand feature reliability.•Fully and semi-supervised learning approaches to assess the behavior of the models.
ISSN:	0952-1976
DOI:	10.1016/j.engappai.2025.110320