Cross-Modal Contrastive Learning Network for Few-Shot Action Recognition

Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on image processing Vol. 33; pp. 1257 - 1271
Main Authors	Wang, Xiao, Yan, Yan, Hu, Hai-Miao, Li, Bo, Wang, Hanzi
Format	Journal Article
Language	English
Published	United States IEEE 2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	action recognition Activity recognition Context contrastive learning Feature extraction Few-shot learning Frames (data processing) Generative adversarial networks Image recognition Machine learning meta-learning Modules Self-supervised learning Semantics Task analysis Three-dimensional displays video understanding Visual discrimination Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Few-shot action recognition aims to recognize new unseen categories with only a few labeled samples of each class. However, it still suffers from the limitation of inadequate data, which easily leads to the overfitting and low-generalization problems. Therefore, we propose a cross-modal contrastive learning network (CCLN), consisting of an adversarial branch and a contrastive branch, to perform effective few-shot action recognition. In the adversarial branch, we elaborately design a prototypical generative adversarial network (PGAN) to obtain synthesized samples for increasing training samples, which can mitigate the data scarcity problem and thereby alleviate the overfitting problem. When the training samples are limited, the obtained visual features are usually suboptimal for video understanding as they lack discriminative information. To address this issue, in the contrastive branch, we propose a cross-modal contrastive learning module (CCLM) to obtain discriminative feature representations of samples with the help of semantic information, which can enable the network to enhance the feature learning ability at the class-level. Moreover, since videos contain crucial sequences and ordering information, thus we introduce a spatial-temporal enhancement module (SEM) to model the spatial context within video frames and the temporal context across video frames. The experimental results show that the proposed CCLN outperforms the state-of-the-art few-shot action recognition methods on four challenging benchmarks, including Kinetics, UCF101, HMDB51 and SSv2.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1057-7149 1941-0042
DOI:	10.1109/TIP.2024.3354104