Towards Completeness: A Generalizable Action Proposal Generator for Zero-Shot Temporal Action Localization
To address the zero-shot temporal action localization (ZSTAL) task, existing works develop models that are generalizable to detect and classify actions from unseen categories. They typically develop a category-agnostic action detector and combine it with the Contrastive Language-Image Pre-training (...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
25.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | To address the zero-shot temporal action localization (ZSTAL) task, existing
works develop models that are generalizable to detect and classify actions from
unseen categories. They typically develop a category-agnostic action detector
and combine it with the Contrastive Language-Image Pre-training (CLIP) model to
solve ZSTAL. However, these methods suffer from incomplete action proposals
generated for \textit{unseen} categories, since they follow a frame-level
prediction paradigm and require hand-crafted post-processing to generate action
proposals. To address this problem, in this work, we propose a novel model
named Generalizable Action Proposal generator (GAP), which can interface
seamlessly with CLIP and generate action proposals in a holistic way. Our GAP
is built in a query-based architecture and trained with a proposal-level
objective, enabling it to estimate proposal completeness and eliminate the
hand-crafted post-processing. Based on this architecture, we propose an
Action-aware Discrimination loss to enhance the category-agnostic dynamic
information of actions. Besides, we introduce a Static-Dynamic Rectifying
module that incorporates the generalizable static information from CLIP to
refine the predicted proposals, which improves proposal completeness in a
generalizable manner. Our experiments show that our GAP achieves
state-of-the-art performance on two challenging ZSTAL benchmarks, i.e.,
Thumos14 and ActivityNet1.3. Specifically, our model obtains significant
performance improvement over previous works on the two benchmarks, i.e., +3.2%
and +3.4% average mAP, respectively. |
---|---|
DOI: | 10.48550/arxiv.2408.13777 |