Self-Supervised Contrastive Learning for Audio-Visual Action Recognition

The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE International Conference on Image Processing (ICIP) pp. 1000 - 1004
Main Authors	Liu, Yang, Tan, Ying, Lan, Haoyuan
Format	Conference Proceeding
Language	English
Published	IEEE 08.10.2023
Subjects	Action Recognition Audio-Visual Benchmark testing Contrastive Learning Correlation Fuses Image recognition Self-supervised Videos Visualization
Online Access	Get full text
DOI	10.1109/ICIP49359.2023.10222383

Cover

Loading…

More Information
Summary:	The underlying correlation between audio and visual modalities can be utilized to learn supervised information for unlabeled videos. In this paper, we propose an end-to-end self-supervised framework named Audio-Visual Contrastive Learning (AVCL), to learn discriminative audio-visual representations for action recognition. Specifically, we design an attention based multi-modal fusion module (AMFM) to fuse audio and visual modalities. To align heterogeneous audio-visual modalities, we construct a novel co-correlation guided representation alignment module (CGRA). To learn supervised information from unlabeled videos, we propose a novel self-supervised contrastive learning module (SelfCL). Furthermore, we build a new audio-visual action recognition dataset named Kinetics-Sounds100. Experimental results on the Kinetics-Sounds32 and Kinetics-Sounds100 datasets demonstrate the superiority of our AVCL over the state-of-the-art methods on large-scale action recognition benchmark.
DOI:	10.1109/ICIP49359.2023.10222383