TSRN: two-stage refinement network for temporal action segmentation

In high-level video semantic understanding, continuous action segmentation is a challenging task aimed at segmenting an untrimmed video and labeling each segment with predefined labels over time. However, the accuracy of segment predictions is limited by confusing information in video sequences, suc...

Full description

Saved in:

Bibliographic Details
Published in	Pattern analysis and applications : PAA Vol. 26; no. 3; pp. 1375 - 1393
Main Authors	Tian, Xiaoyan, Jin, Ye, Tang, Xianglong
Format	Journal Article
Language	English
Published	London Springer London 01.08.2023 Springer Nature B.V
Subjects	Computer Science Labels Pattern Recognition Segmentation Segments Semantics Theoretical Advances Temporal action segmentation Refinement network Over-segmentation Video semantic understanding Self-attention
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In high-level video semantic understanding, continuous action segmentation is a challenging task aimed at segmenting an untrimmed video and labeling each segment with predefined labels over time. However, the accuracy of segment predictions is limited by confusing information in video sequences, such as ambiguous frames during action boundaries or over-segmentation errors due to the lack of semantic relations. In this work, we present a two-stage refinement network (TSRN) to improve temporal action segmentation. We first capture global relations over an entire video sequence using a multi-head self-attention mechanism in the novel transformer temporal convolutional network and model temporal relations in each action segment. Then, we introduce a dual-attention spatial pyramid pooling network to fuse features from macroscale and microscale perspectives, providing more accurate classification results from the initial prediction. In addition, a joint loss function mitigates over-segmentation. Compared with state-of-the-art methods, the proposed TSRN substantially improves temporal action segmentation on three challenging datasets (i.e., 50Salads, Georgia Tech Egocentric Activities, and Breakfast).
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1433-7541 1433-755X
DOI:	10.1007/s10044-023-01166-8