Integration of Global and Local Knowledge for Foreground Enhancing in Weakly Supervised Temporal Action Localization

Weakly Supervised Temporal Action Localization (WTAL) aims to identify the temporal duration of actions and classify the action categories with only video-level labels in the training stage. Motivated by the intuition that the attention maps generated from various views will assist in enhancing the...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on multimedia Vol. 26; pp. 8476 - 8487
Main Authors Zhang, Tianyi, Li, Ronglu, Feng, Pengming, Zhang, Rubo
Format Journal Article
LanguageEnglish
Published IEEE 2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Weakly Supervised Temporal Action Localization (WTAL) aims to identify the temporal duration of actions and classify the action categories with only video-level labels in the training stage. Motivated by the intuition that the attention maps generated from various views will assist in enhancing the foreground action temporal segments, in this paper we propose a WTAL pipeline based on a novel attention mechanism that effectively integrates global and local knowledge. Our attention mechanism is mainly composed of a global attention branch and a local attention branch. Specifically, the global attention branch is built on the inter-segment similarity to sparsely mine out the correlation knowledge within the entire video, while the local attention branch is built on the convolutional structure to densely aggregate the information within the fixed local respective field. Experiments on THUMOS14 and ActivityNet v1.3 datasets demonstrate the effectiveness of our proposed WTAL pipeline compared to state-of-the-art methods.
ISSN:1520-9210
1941-0077
DOI:10.1109/TMM.2024.3379887