Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the o...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE International Conference on Multimedia and Expo (ICME) pp. 642 - 647
Main Authors	Wang, Jinxin, Guo, Zhongwen, Yang, Chao, Li, Xiaomei, Cui, Ziyuan
Format	Conference Proceeding
Language	English
Published	IEEE 01.07.2023
Subjects	Audio-visual recognition Correlation deep learning Feature extraction multi-modality feature extraction Speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules.
ISSN:	1945-788X
DOI:	10.1109/ICME55011.2023.00116