DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion

Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively addre...

Full description

Saved in:

Bibliographic Details
Published in	Applied intelligence (Dordrecht, Netherlands) Vol. 55; no. 3; p. 224
Main Authors	Wu, Jinghan, Zhang, Yakun, Zhang, Meishan, Zheng, Changyan, Zhang, Xingyu, Xie, Liang, An, Xingwei, Yin, Erwei
Format	Journal Article
Language	English
Published	Boston Springer Nature B.V 01.02.2025
Subjects	Audio data Audio signals Speech Speech recognition Voice recognition
Online Access	Get full text

Cover

Loading…

Abstract	Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.
AbstractList	Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise.
ArticleNumber	224
Author	Zhang, Meishan Wu, Jinghan Xie, Liang Zheng, Changyan Yin, Erwei An, Xingwei Zhang, Xingyu Zhang, Yakun
Author_xml	– sequence: 1 givenname: Jinghan surname: Wu fullname: Wu, Jinghan – sequence: 2 givenname: Yakun surname: Zhang fullname: Zhang, Yakun – sequence: 3 givenname: Meishan surname: Zhang fullname: Zhang, Meishan – sequence: 4 givenname: Changyan surname: Zheng fullname: Zheng, Changyan – sequence: 5 givenname: Xingyu surname: Zhang fullname: Zhang, Xingyu – sequence: 6 givenname: Liang surname: Xie fullname: Xie, Liang – sequence: 7 givenname: Xingwei surname: An fullname: An, Xingwei – sequence: 8 givenname: Erwei orcidid: 0000-0002-2147-9888 surname: Yin fullname: Yin, Erwei
BookMark	eNotkEFPwzAMhSM0JMbgD3CKxLngtFnbcJsGDKQJLiBxi9LEGR1rUpIUxL-nYxwsW35PfvJ3SibOOyTkgsEVA6iuIwNeiwxynkHJ2DgdkSmbV0VWcVFNyBTEKJWleDshpzFuAaAogE1JczssVk-YbqhydHABYwqtTmhoN-xS23mjdjT2iPqdBtR-49rUekdtUB1--_BBh9i6DTXD6FNG9an9QrpRab-0o-bdGTm2ahfx_L_PyOv93cvyIVs_rx6Xi3Wm87xMmeCMmbzCBkzBc8Nto3VTqqayBoyYW1Voa4WqtUDDsda8qaEqUaA2AFhXxYxcHu72wX8O4yNy64fgxkhZMC7GArZ35QeXDj7GgFb2oe1U-JEM5J6lPLCUI0v5x1JC8QsEU2u9
Cites_doi	10.1109/CVPR.2017.367 10.1109/CVPR52729.2023.01801 10.1109/ICASSP40776.2020.9053841 10.1109/ICPR48806.2021.9412454 10.1007/s10489-020-01725-0 10.1109/THMS.2022.3226197 10.3389/fnbot.2022.971446 10.1109/TASL.2013.2244083 10.18653/v1/2022.acl-long.308 10.1109/JBHI.2020.3034158 10.1109/ICASSP39728.2021.9414567 10.1109/ICASSP.2018.8461326 10.1109/ICIP42928.2021.9506396 10.1109/SLT54892.2023.10022656 10.1109/EMBC46164.2021.9630373 10.1109/TNNLS.2022.3163771 10.1016/j.inffus.2022.09.006 10.1016/j.eswa.2024.124159 10.1109/ICASSP.2018.8462015 10.1109/TNNLS.2022.3202842 10.23919/Eusipco47968.2020.9287841 10.1109/TASLP.2020.3039600 10.1109/ICASSP49357.2023.10096889 10.1109/ICASSP.2015.7178964 10.1109/ICASSP48485.2024.10448106 10.1609/aaai.v37i11.26484 10.1007/s11760-019-01630-1 10.1109/TMM.2022.3185894 10.1109/ICASSP.2019.8683733 10.1109/CVPRW56347.2022.00504 10.1016/j.compeleceng.2021.107026 10.1007/s10489-024-05380-7 10.1109/CVPR52733.2024.02567 10.1109/ICASSP48485.2024.10446769 10.1007/s10489-024-05381-6 10.1109/CVPR42600.2020.01271 10.1109/ICIP40778.2020.9190894 10.1016/j.bspc.2022.104298 10.1016/0167-6393(93)90095-3 10.1109/TCDS.2023.3316701 10.1109/ASYU52992.2021.9599016 10.1109/ICASSP39728.2021.9415063 10.1561/116.00000050 10.1109/TASLP.2020.2998279 10.1109/CVPR52729.2023.01018 10.1007/s10489-021-02846-w 10.1109/ICASSP39728.2021.9415077
ContentType	Journal Article
Copyright	Copyright Springer Nature B.V. Feb 2025
Copyright_xml	– notice: Copyright Springer Nature B.V. Feb 2025
DBID	AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D
DOI	10.1007/s10489-024-06119-0
DatabaseName	CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional
DatabaseTitle	CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional
DatabaseTitleList	Computer and Information Systems Abstracts
DeliveryMethod	fulltext_linktorsrc
Discipline	Computer Science
EISSN	1573-7497
ExternalDocumentID	10_1007_s10489_024_06119_0
GroupedDBID	-Y2 -~C -~X .86 .DC .VR 06D 0R~ 0VY 1N0 1SB 2.D 203 23M 28- 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 77K 7WY 8FE 8FG 8FL 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AAPKM AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYXX AAYZH ABAKF ABBBX ABBRH ABBXA ABDBE ABDZT ABECU ABFSG ABFTV ABHLI ABHQN ABIVO ABJCF ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACHSB ACHXU ACIWK ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACSNA ACSTC ACZOJ ADHHG ADHIR ADHKG ADIMF ADKFA ADKNI ADKPE ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AEZWR AFBBN AFDZB AFEXP AFGCZ AFHIU AFKRA AFLOW AFOHR AFQWF AFWTZ AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGQPQ AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHPBZ AHSBF AHWEU AHYZX AIAKS AIGIU AIIXL AILAN AITGF AIXLP AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARMRJ ASPBG ATHPR AVWKF AXYYD AYFIA AYJHY AZFZN AZQEC B-. BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ BSONS CAG CCPQU CITATION COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DWQXO EBLON EBS EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ7 GQ8 GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I09 IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV KOW L6V LAK LLZTM M0C M4Y M7S MA- N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P2P P62 P9O PF0 PHGZM PHGZT PQBIZ PQBZA PQQKQ PROAC PSYQQ PT4 PT5 PTHSS Q2X QOK QOS R4E R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TEORI TSG TSK TSV TUC U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 ZMTXR ZY4 ~A9 ~EX 7SC 8FD ABRTQ JQ2 L7M L~C L~D
ID	FETCH-LOGICAL-c226t-9411d27eb0d342d4fbccb6ab7fd0d95fa3cff9a8c9ed4e8c4b8076e9ecd00e873
ISSN	0924-669X
IngestDate	Fri Jul 25 12:16:38 EDT 2025 Tue Jul 01 03:32:03 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	3
Language	English
LinkModel	OpenURL
MergedId	FETCHMERGED-LOGICAL-c226t-9411d27eb0d342d4fbccb6ab7fd0d95fa3cff9a8c9ed4e8c4b8076e9ecd00e873
Notes	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ORCID	0000-0002-2147-9888
PQID	3149314017
PQPubID	326365
ParticipantIDs	proquest_journals_3149314017 crossref_primary_10_1007_s10489_024_06119_0
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2025-02-00 20250201
PublicationDateYYYYMMDD	2025-02-01
PublicationDate_xml	– month: 02 year: 2025 text: 2025-02-00
PublicationDecade	2020
PublicationPlace	Boston
PublicationPlace_xml	– name: Boston
PublicationTitle	Applied intelligence (Dordrecht, Netherlands)
PublicationYear	2025
Publisher	Springer Nature B.V
Publisher_xml	– name: Springer Nature B.V
References	6119_CR38 6119_CR39 6119_CR30 6119_CR32 6119_CR33 6119_CR34 6119_CR35 6119_CR36 D Yu (6119_CR47) 2016 C Fan (6119_CR8) 2021; 29 Y Zhang (6119_CR52) 2023; 15 Q Song (6119_CR37) 2022; 34 S Sarkar (6119_CR9) 2024; 54 6119_CR27 6119_CR28 6119_CR29 AK Gupta (6119_CR10) 2022; 52 G Chen (6119_CR3) 2020; 50 6119_CR20 6119_CR21 6119_CR22 6119_CR23 6119_CR24 6119_CR25 6119_CR26 AB Hassanat (6119_CR11) 2011; 1 X Chen (6119_CR5) 2023; 53 D Zhou (6119_CR53) 2023; 25 6119_CR16 6119_CR17 6119_CR18 6119_CR19 D Zhou (6119_CR54) 2024; 35 L Deng (6119_CR6) 2013; 21 6119_CR55 6119_CR12 6119_CR14 LA Passos (6119_CR31) 2023; 90 6119_CR15 6119_CR50 6119_CR51 K Ding (6119_CR7) 2024; 54 6119_CR49 X Chen (6119_CR4) 2020; 14 6119_CR41 6119_CR42 6119_CR44 6119_CR45 6119_CR46 NS Jong (6119_CR13) 2020; 25 6119_CR48 ZQ Wang (6119_CR43) 2020; 28 6119_CR2 6119_CR1 6119_CR40
References_xml	– ident: 6119_CR36 doi: 10.1109/CVPR.2017.367 – ident: 6119_CR12 doi: 10.1109/CVPR52729.2023.01801 – ident: 6119_CR24 doi: 10.1109/ICASSP40776.2020.9053841 – ident: 6119_CR18 doi: 10.1109/ICPR48806.2021.9412454 – volume: 50 start-page: 3503 year: 2020 ident: 6119_CR3 publication-title: Applied Intell doi: 10.1007/s10489-020-01725-0 – ident: 6119_CR40 – volume: 53 start-page: 335 issue: 2 year: 2023 ident: 6119_CR5 publication-title: IEEE Trans Human-Mach Syst doi: 10.1109/THMS.2022.3226197 – ident: 6119_CR45 doi: 10.3389/fnbot.2022.971446 – volume: 21 start-page: 1060 issue: 5 year: 2013 ident: 6119_CR6 publication-title: IEEE Trans Audio, Speech, and Language Process doi: 10.1109/TASL.2013.2244083 – ident: 6119_CR28 doi: 10.18653/v1/2022.acl-long.308 – volume: 25 start-page: 1997 issue: 6 year: 2020 ident: 6119_CR13 publication-title: IEEE J Biomed Health Inf doi: 10.1109/JBHI.2020.3034158 – volume-title: Automatic Speech Recognition year: 2016 ident: 6119_CR47 – ident: 6119_CR35 – ident: 6119_CR20 doi: 10.1109/ICASSP39728.2021.9414567 – ident: 6119_CR32 doi: 10.1109/ICASSP.2018.8461326 – ident: 6119_CR51 doi: 10.1109/ICIP42928.2021.9506396 – ident: 6119_CR14 doi: 10.1109/SLT54892.2023.10022656 – ident: 6119_CR44 doi: 10.1109/EMBC46164.2021.9630373 – volume: 34 start-page: 10028 issue: 12 year: 2022 ident: 6119_CR37 publication-title: IEEE Trans Neural Netw Learn Syst doi: 10.1109/TNNLS.2022.3163771 – volume: 90 start-page: 1 year: 2023 ident: 6119_CR31 publication-title: Inf Fusion doi: 10.1016/j.inffus.2022.09.006 – ident: 6119_CR33 doi: 10.1016/j.eswa.2024.124159 – ident: 6119_CR50 doi: 10.1109/ICASSP.2018.8462015 – volume: 35 start-page: 5226 issue: 4 year: 2024 ident: 6119_CR54 publication-title: IEEE Trans Neural Netw Learn Syst doi: 10.1109/TNNLS.2022.3202842 – ident: 6119_CR49 doi: 10.23919/Eusipco47968.2020.9287841 – volume: 29 start-page: 198 year: 2021 ident: 6119_CR8 publication-title: IEEE/ACM Trans Audio, Speech, and Language Process doi: 10.1109/TASLP.2020.3039600 – ident: 6119_CR15 – ident: 6119_CR21 doi: 10.1109/ICASSP49357.2023.10096889 – ident: 6119_CR29 doi: 10.1109/ICASSP.2015.7178964 – volume: 1 start-page: 279 year: 2011 ident: 6119_CR11 publication-title: Speech and Language Technol. – ident: 6119_CR48 doi: 10.1109/ICASSP48485.2024.10448106 – ident: 6119_CR2 doi: 10.1609/aaai.v37i11.26484 – volume: 14 start-page: 981 issue: 5 year: 2020 ident: 6119_CR4 publication-title: Signal, Image and Video Process doi: 10.1007/s11760-019-01630-1 – ident: 6119_CR23 – volume: 25 start-page: 4986 year: 2023 ident: 6119_CR53 publication-title: IEEE Trans Multimed doi: 10.1109/TMM.2022.3185894 – ident: 6119_CR55 doi: 10.1109/ICASSP.2019.8683733 – ident: 6119_CR27 doi: 10.1109/CVPRW56347.2022.00504 – ident: 6119_CR46 doi: 10.1016/j.compeleceng.2021.107026 – volume: 54 start-page: 4507 issue: 6 year: 2024 ident: 6119_CR9 publication-title: Applied Intell doi: 10.1007/s10489-024-05380-7 – ident: 6119_CR25 doi: 10.1109/CVPR52733.2024.02567 – ident: 6119_CR41 doi: 10.1109/ICASSP48485.2024.10446769 – volume: 54 start-page: 5674 issue: 7 year: 2024 ident: 6119_CR7 publication-title: Applied Intell doi: 10.1007/s10489-024-05381-6 – ident: 6119_CR42 doi: 10.1109/CVPR42600.2020.01271 – ident: 6119_CR17 doi: 10.1109/ICIP40778.2020.9190894 – ident: 6119_CR38 doi: 10.1016/j.bspc.2022.104298 – ident: 6119_CR39 doi: 10.1016/0167-6393(93)90095-3 – volume: 15 start-page: 2282 issue: 4 year: 2023 ident: 6119_CR52 publication-title: IEEE Trans Cogn Development Syst doi: 10.1109/TCDS.2023.3316701 – ident: 6119_CR1 doi: 10.1109/ASYU52992.2021.9599016 – ident: 6119_CR19 doi: 10.1109/ICASSP39728.2021.9415063 – ident: 6119_CR16 doi: 10.1561/116.00000050 – volume: 28 start-page: 1778 year: 2020 ident: 6119_CR43 publication-title: IEEE/ACM Trans Audio, Speech, and Language Process doi: 10.1109/TASLP.2020.2998279 – ident: 6119_CR22 doi: 10.1109/ICASSP49357.2023.10096889 – ident: 6119_CR26 doi: 10.1109/CVPR52729.2023.01018 – volume: 52 start-page: 9001 issue: 8 year: 2022 ident: 6119_CR10 publication-title: Applied Intell doi: 10.1007/s10489-021-02846-w – ident: 6119_CR30 doi: 10.1109/ICASSP39728.2021.9415077 – ident: 6119_CR34
SSID	ssj0003301
Score	2.3757048
Snippet	Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal...
SourceID	proquest crossref
SourceType	Aggregation Database Index Database
StartPage	224
SubjectTerms	Audio data Audio signals Speech Speech recognition Voice recognition
Title	DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion
URI	https://www.proquest.com/docview/3149314017
Volume	55
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELZge-FCeRRRaJEP3FapvLHXTnrbdrssD62QaKXeLL8iOLBd0c2lv77jV5KWCgGXaJVok8ifM54Zf98MQu_1pKk1XCyU5aZgyii_SQjGcMq1IA34FC6wLVZ8ecE-XU4v-7aKQV2y1Ufm5kFdyf-gCucAV6-S_Qdku5vCCfgN-MIREIbjX2E8b2cfVi4k9-AzbX3lHl9w3zuRgSj488p6OcjGOfN93FGFPLcwU7LGbcgVBEGWsmoTiES-6IbnV7bXGbRcpja5rD-GdTzBRZ1DBAu3j1H-QEKc0wwpq1BOMxH5XlbRU6b9Rkaneompw5IVnIcWuLCGJMMpaCFY5NpmyxoL8KYZRIdmMuqmfzPfJMuZmSdywVPA2_Aiq36xyhv0y9k3-XW-kF8-rj4_RjslBAnlCO3MFicnq24lpjS0v-7eN4mmknTy3jPuOiZ31-XgbJw_Q09TlIBnEfLn6JFbv0C7uQMHTgb5JdJpBhxjtcZD_HGPP4744wH-uMMfB_yxxx9n_HHEH0f899DF4uz8dFmkrhmFAVd6W9RsMrGlcJpYykrLGm2M5kqLxhJbTxtFTdPUqjK1s8xVhumKCO5qZywhrhL0FRqtr9buNcK8ZpWFf1TEVqzhXAmtCbOKEMudq-g-Guchk5tYHEX2ZbD9AEsYYBkGWJJ9dJBHVaaP6FpSiNCpD_LFmz9ffoue9PP0AI22v1p3CP7gVr9LsN8Cgo1kJw
linkProvider	Library Specific Holdings
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DuAGNet%3A+an+unrestricted+multimodal+speech+recognition+framework+using+dual+adaptive+gating+fusion&rft.jtitle=Applied+intelligence+%28Dordrecht%2C+Netherlands%29&rft.date=2025-02-01&rft.pub=Springer+Nature+B.V&rft.issn=0924-669X&rft.eissn=1573-7497&rft.volume=55&rft.issue=3&rft.spage=224&rft_id=info:doi/10.1007%2Fs10489-024-06119-0&rft.externalDBID=HAS_PDF_LINK
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0924-669X&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0924-669X&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0924-669X&client=summon