KFC: An Efficient Framework for Semi-Supervised Temporal Action Localization

In temporal action localization (TAL), semi-supervised learning is a promising technique to mitigate the cost of precise boundary annotations. Semi-supervised approaches employing consistency regularization (CR), encouraging models to be robust to the perturbed inputs, have achieved great success in...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 30; pp. 6869 - 6878
Main Authors	Ding, Xinpeng, Wang, Nannan, Gao, Xinbo, Li, Jie, Wang, Xiaoyu, Liu, Tongliang
Format	Journal Article
Language	English
Published	New York IEEE 2021 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Annotations Crossovers Feature extraction Image classification Localization Location awareness Perturbation Perturbation methods Regularization Robustness Semantics semi-supervised learning Semisupervised learning Temporal action localization Training Video data video understanding
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In temporal action localization (TAL), semi-supervised learning is a promising technique to mitigate the cost of precise boundary annotations. Semi-supervised approaches employing consistency regularization (CR), encouraging models to be robust to the perturbed inputs, have achieved great success in image classification problems. The success of CR is largely depended on the perturbations, where instances are perturbed to train a robust model without altering their semantic information. However, the perturbations for image or video classification tasks are not fit to apply to TAL. Since videos in TAL are too long to train the model with raw videos in an end-to-end manner. In this paper, we devise a method named K-farthest crossover to construct perturbations based on video features and apply it to TAL. Motivated by the observation that features in the same action instance become more and more similar during the training process while those in different action instances or backgrounds become more and more divergent, we add perturbations to each feature along temporal axis and adopt CR to encourage the model to retain this observation. Specifically, for a feature, we first find the top-k dissimilar features and average them to form a perturbation. Then, similar to chromosomal crossover, we select a large part of the feature and a small part of the perturbation to recombine a perturbed feature, which preserves the feature semantics yet enough discrepancy.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1057-7149 1941-0042
DOI:	10.1109/TIP.2021.3099407