KFC: An Efficient Framework for Semi-Supervised Temporal Action Localization

In temporal action localization (TAL), semi-supervised learning is a promising technique to mitigate the cost of precise boundary annotations. Semi-supervised approaches employing consistency regularization (CR), encouraging models to be robust to the perturbed inputs, have achieved great success in...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on image processing Vol. 30; pp. 6869 - 6878
Main Authors Ding, Xinpeng, Wang, Nannan, Gao, Xinbo, Li, Jie, Wang, Xiaoyu, Liu, Tongliang
Format Journal Article
LanguageEnglish
Published New York IEEE 2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:In temporal action localization (TAL), semi-supervised learning is a promising technique to mitigate the cost of precise boundary annotations. Semi-supervised approaches employing consistency regularization (CR), encouraging models to be robust to the perturbed inputs, have achieved great success in image classification problems. The success of CR is largely depended on the perturbations, where instances are perturbed to train a robust model without altering their semantic information. However, the perturbations for image or video classification tasks are not fit to apply to TAL. Since videos in TAL are too long to train the model with raw videos in an end-to-end manner. In this paper, we devise a method named K-farthest crossover to construct perturbations based on video features and apply it to TAL. Motivated by the observation that features in the same action instance become more and more similar during the training process while those in different action instances or backgrounds become more and more divergent, we add perturbations to each feature along temporal axis and adopt CR to encourage the model to retain this observation. Specifically, for a feature, we first find the top-k dissimilar features and average them to form a perturbation. Then, similar to chromosomal crossover, we select a large part of the feature and a small part of the perturbation to recombine a perturbed feature, which preserves the feature semantics yet enough discrepancy.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:1057-7149
1941-0042
DOI:10.1109/TIP.2021.3099407