Perfect Match: Improved Cross-modal Embeddings for Audio-visual Synchronisation

This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronisation. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent ad...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 3965 - 3969
Main Authors	Chung, Soo-Whan, Chung, Joon Son, Kang, Hong-Goo
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2019
Subjects	audio-visual synchronisation cross-modal embedding Cross-modal supervision Lips self-supervised learning Speech recognition Streaming media Synchronization Task analysis Training Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	This paper proposes a new strategy for learning powerful cross-modal embeddings for audio-to-video synchronisation. Here, we set up the problem as one of cross-modal retrieval, where the objective is to find the most relevant audio segment given a short video clip. The method builds on the recent advances in learning representations from cross-modal self-supervision. The main contributions of this paper are as follows: (1) we propose a new learning strategy where the embeddings are learnt via a multi-way matching problem, as opposed to a binary classification (matching or non-matching) problem as proposed by recent papers; (2) we demonstrate that performance of this method far exceeds the existing baselines on the synchronisation task; (3) we use the learnt embeddings for visual speech recognition in self-supervision, and show that the performance matches the representations learnt end-to-end in a fully-supervised manner.
ISSN:	2379-190X
DOI:	10.1109/ICASSP.2019.8682524