Interaction Compass: Multi-Label Zero-Shot Learning of Human-Object Interactions via Spatial Relations

We study the problem of multi-label zero-shot recognition in which labels are in the form of human-object interactions (combinations of actions on objects), each image may contain multiple interactions and some interactions do not have training images. We propose a novel compositional learning frame...

Full description

Saved in:

Bibliographic Details
Published in	2021 IEEE/CVF International Conference on Computer Vision (ICCV) pp. 8452 - 8463
Main Authors	Huynh, Dat, Elhamifar, Ehsan
Format	Conference Proceeding
Language	English
Published	IEEE 01.10.2021
Subjects	Action and behavior recognition Annotations Computational modeling Computer vision Genomics Image recognition Recognition and classification Training Transfer/Low-shot/Semi/Unsupervised Learning Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We study the problem of multi-label zero-shot recognition in which labels are in the form of human-object interactions (combinations of actions on objects), each image may contain multiple interactions and some interactions do not have training images. We propose a novel compositional learning framework that decouples interaction labels into separate action and object scores that incorporate the spatial compatibility between the two components. We combine these scores to efficiently recognize seen and unseen interactions. However, learning action-object spatial relations, in principle, requires bounding-box annotations, which are costly to gather. Moreover, it is not clear how to generalize spatial relations to unseen interactions. We address these challenges by developing a cross-attention mechanism that localizes objects from action locations and vice versa by predicting displacements between them, referred to as relational directions. During training, we estimate the relational directions as ones maximizing the scores of ground-truth interactions that guide predictions toward compatible action-object regions. By extensive experiments, we show the effectiveness of our framework, where we improve the state of the art by 2.6% mAP score and 5.8% recall score on HICO and Visual Genome datasets, respectively. 1
ISSN:	2380-7504
DOI:	10.1109/ICCV48922.2021.00836