Learning Contextually Fused Audio-Visual Representations For Audio-Visual Speech Recognition

With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful in...

Full description

Saved in:

Bibliographic Details
Published in	2022 IEEE International Conference on Image Processing (ICIP) pp. 1346 - 1350
Main Authors	Zhang, Zi-Qiang, Zhang, Jie, Zhang, Jian-Shu, Wu, Ming-Hui, Fang, Xin, Dai, Li-Rong
Format	Conference Proceeding
Language	English
Published	IEEE 16.10.2022
Subjects	Audio-visual representation learning audiovisual speech recognition Image processing Representation learning Self-supervised learning Speech recognition Training Transformers Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	With the advance in self-supervised learning for audio and visual modalities, it has become possible to learn a robust audio-visual speech representation. This would be beneficial for improving the audio-visual speech recognition (AVSR) performance, as the multi-modal inputs contain more fruitful information in principle. In this paper, based on existing self-supervised representation learning methods for audio modality, we therefore propose an audio-visual representation learning approach. The proposed approach explores both the complementarity of audio-visual modalities and long-term context dependency using a transformer-based fusion module and a flexible masking strategy. After pre-training, the model is able to extract fused representations required by AVSR. Without loss of generality, it can be applied to single-modal tasks, e.g., audio/visual speech recognition by simply masking out one modality in the fusion module. The proposed pre-trained model is evaluated on speech recognition and lipreading tasks using one or two modalities, where the superiority is revealed.
ISSN:	2381-8549
DOI:	10.1109/ICIP46576.2022.9897235