Integration of Global and Local Knowledge for Foreground Enhancing in Weakly Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization (WTAL) aims to identify the temporal duration of actions and classify the action categories with only video-level labels in the training stage. Motivated by the intuition that the attention maps generated from various views will assist in enhancing the...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 26; pp. 8476 - 8487
Main Authors	Zhang, Tianyi, Li, Ronglu, Feng, Pengming, Zhang, Rubo
Format	Journal Article
Language	English
Published	IEEE 2024
Subjects	Annotations Convolution Feature extraction Location awareness Pipelines Task analysis temporal action localization Training video content analysis Weakly supervised learning
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Weakly Supervised Temporal Action Localization (WTAL) aims to identify the temporal duration of actions and classify the action categories with only video-level labels in the training stage. Motivated by the intuition that the attention maps generated from various views will assist in enhancing the foreground action temporal segments, in this paper we propose a WTAL pipeline based on a novel attention mechanism that effectively integrates global and local knowledge. Our attention mechanism is mainly composed of a global attention branch and a local attention branch. Specifically, the global attention branch is built on the inter-segment similarity to sparsely mine out the correlation knowledge within the entire video, while the local attention branch is built on the convolutional structure to densely aggregate the information within the fixed local respective field. Experiments on THUMOS14 and ActivityNet v1.3 datasets demonstrate the effectiveness of our proposed WTAL pipeline compared to state-of-the-art methods.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3379887