Multi-Scale Hybrid Fusion Network for Mandarin Audio-Visual Speech Recognition
Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the o...
Saved in:
Published in | 2023 IEEE International Conference on Multimedia and Expo (ICME) pp. 642 - 647 |
---|---|
Main Authors | , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
01.07.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Compared to feature or decision fusion, hybrid fusion can beneficially improve audio-visual speech recognition accuracy. Existing works are mainly prone to design the multi-modality feature extraction process, interaction, and prediction, neglecting useful information on the multi-modality and the optimal combination of different predicted results. In this paper, we propose a multi-scale hybrid fusion network (MSHF) for mandarin audio-visual speech recognition. Our MSHF consists of a feature extraction subnetwork to exploit the proposed multi-scale feature extraction module (MSFE) to obtain multi-scale features and a hybrid fusion subnetwork to integrate the intrinsic correlation of different modality information, optimizing the weights of prediction results for different modalities to achieve the best classification. We further design a feature recognition module (FRM) for accurate audio-visual speech recognition. We conducted experiments on the CAS-VSR-W1k dataset. The experimental results show that the proposed method outperforms the selected competitive baselines and the state-of-the-art, indicating the superiority of our proposed modules. |
---|---|
ISSN: | 1945-788X |
DOI: | 10.1109/ICME55011.2023.00116 |