DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion
Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively addre...
Saved in:
Published in | Applied intelligence (Dordrecht, Netherlands) Vol. 55; no. 3; p. 224 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
Boston
Springer Nature B.V
01.02.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ISSN: | 0924-669X 1573-7497 |
DOI: | 10.1007/s10489-024-06119-0 |