Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition

Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the o...

Full description

Saved in:
Bibliographic Details
Published in2023 IEEE International Conference on Multimedia and Expo (ICME) pp. 642 - 647
Main Authors Wang, Jinxin, Guo, Zhongwen, Yang, Chao, Li, Xiaomei, Cui, Ziyuan
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.07.2023
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules.
ISSN:1945-788X
DOI:10.1109/ICME55011.2023.00116