Listen to Look: Action Recognition by Previewing Audio

In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, w...

Full description

Saved in:

Bibliographic Details
Published in	2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) pp. 10454 - 10464
Main Authors	Gao, Ruohan, Oh, Tae-Hyun, Grauman, Kristen, Torresani, Lorenzo
Format	Conference Proceeding
Language	English
Published	IEEE 01.01.2020
Subjects	Buildings Image recognition Image segmentation Proposals Redundancy Spatiotemporal phenomena Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In the face of the video data deluge, today's expensive clip-level classifiers are increasingly impractical. We propose a framework for efficient action recognition in untrimmed video that uses audio as a preview mechanism to eliminate both short-term and long-term visual redundancies. First, we devise an ImgAud2Vid framework that hallucinates clip-level features by distilling from lighter modalities---a single frame and its accompanying audio---reducing short-term temporal redundancy for efficient clip-level recognition. Second, building on ImgAud2Vid, we further propose ImgAud-Skimming, an attention-based long short-term memory network that iteratively selects useful moments in untrimmed videos, reducing long-term temporal redundancy for efficient video-level recognition. Extensive experiments on four action recognition datasets demonstrate that our method achieves the state-of-the-art in terms of both recognition accuracy and speed.
ISSN:	2575-7075
DOI:	10.1109/CVPR42600.2020.01047