Deep cascaded action attention network for weakly-supervised temporal action localization

Weakly-supervised temporal action localization (W-TAL) is to locate the boundaries of action instances and classify them in an untrimmed video, which is a challenging task due to only video-level labels during training. Existing methods mainly focus on the most discriminative action snippets of a vi...

Full description

Saved in:
Bibliographic Details
Published inMultimedia tools and applications Vol. 82; no. 19; pp. 29769 - 29787
Main Authors Xia, Hui-fen, Zhan, Yong-zhao
Format Journal Article
LanguageEnglish
Published New York Springer US 01.08.2023
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Weakly-supervised temporal action localization (W-TAL) is to locate the boundaries of action instances and classify them in an untrimmed video, which is a challenging task due to only video-level labels during training. Existing methods mainly focus on the most discriminative action snippets of a video by using top-k multiple instance learning (MIL), and ignore the usage of less discriminative action snippets and non-action snippets. This makes the localization performance improve limitedly. In order to mine the less discriminative action snippets and distinguish the non-action snippets better in a video, a novel method based on deep cascaded action attention network is proposed. In this method, the deep cascaded action attention mechanism is presented to model not only the most discriminative action snippets, but also different levels of less discriminative action snippets by introducing threshold erasing, which ensures the completeness of action instances. Besides, the entropy loss for non-action is introduced to restrict the activations of non-action snippets for all action categories, which are generated by aggregating the bottom-k activation scores along the temporal dimension. Thereby, the action snippets can be distinguished from non-action snippets better, which is beneficial to the separation of action and non-action snippets and enables the action instances more accurate. Ultimately, our method can facilitate more precise action localization. Extensive experiments conducted on THUMOS14 and ActivityNet1.3 datasets show that our method outperforms state-of-the-art methods at several t-IoU thresholds.
ISSN:1380-7501
1573-7721
DOI:10.1007/s11042-023-14670-0