DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion
Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively addre...
Saved in:
Published in | Applied intelligence (Dordrecht, Netherlands) Vol. 55; no. 3; p. 224 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
Boston
Springer Nature B.V
01.02.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise. |
---|---|
AbstractList | Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal speech recognition is not satisfactory in high-noise or silent communication applications. Integrating multiple modalities can effectively address this problem, but existing fusion methods tend to pay excessive attention to the alignment of semantic features and the construction of fused features between modalities, omitting the preservation of single-modal characteristics. In this work, audio signals, visual clues of lip region images, and facial electromyography signals are used for unrestricted speech recognition, which can effectively resist the noise interference brought by single modalities. To preserve the unique feature expression of each speech modality and improve the global perception of the coupling correlations among them, a Dual Adaptive Gating fusion framework is proposed (dubbed DuAGNet), utilizing modality-specific and feature-specific adaptive gating networks. A multimodal speech dataset is constructed from forty subjects to validate the effectiveness of the proposed DuAGNet, covering three modalities of speech data and 100 classes of Chinese phrases. Both the highest recognition accuracy of 98.79% and lowest standard deviation of 0.83 are obtained with clean test data, and a maximum increase of accuracy over 80% is achieved, compared to audio speech recognition systems when introduced severe audio noise. |
ArticleNumber | 224 |
Author | Zhang, Meishan Wu, Jinghan Xie, Liang Zheng, Changyan Yin, Erwei An, Xingwei Zhang, Xingyu Zhang, Yakun |
Author_xml | – sequence: 1 givenname: Jinghan surname: Wu fullname: Wu, Jinghan – sequence: 2 givenname: Yakun surname: Zhang fullname: Zhang, Yakun – sequence: 3 givenname: Meishan surname: Zhang fullname: Zhang, Meishan – sequence: 4 givenname: Changyan surname: Zheng fullname: Zheng, Changyan – sequence: 5 givenname: Xingyu surname: Zhang fullname: Zhang, Xingyu – sequence: 6 givenname: Liang surname: Xie fullname: Xie, Liang – sequence: 7 givenname: Xingwei surname: An fullname: An, Xingwei – sequence: 8 givenname: Erwei orcidid: 0000-0002-2147-9888 surname: Yin fullname: Yin, Erwei |
BookMark | eNotkEFPwzAMhSM0JMbgD3CKxLngtFnbcJsGDKQJLiBxi9LEGR1rUpIUxL-nYxwsW35PfvJ3SibOOyTkgsEVA6iuIwNeiwxynkHJ2DgdkSmbV0VWcVFNyBTEKJWleDshpzFuAaAogE1JczssVk-YbqhydHABYwqtTmhoN-xS23mjdjT2iPqdBtR-49rUekdtUB1--_BBh9i6DTXD6FNG9an9QrpRab-0o-bdGTm2ahfx_L_PyOv93cvyIVs_rx6Xi3Wm87xMmeCMmbzCBkzBc8Nto3VTqqayBoyYW1Voa4WqtUDDsda8qaEqUaA2AFhXxYxcHu72wX8O4yNy64fgxkhZMC7GArZ35QeXDj7GgFb2oe1U-JEM5J6lPLCUI0v5x1JC8QsEU2u9 |
Cites_doi | 10.1109/CVPR.2017.367 10.1109/CVPR52729.2023.01801 10.1109/ICASSP40776.2020.9053841 10.1109/ICPR48806.2021.9412454 10.1007/s10489-020-01725-0 10.1109/THMS.2022.3226197 10.3389/fnbot.2022.971446 10.1109/TASL.2013.2244083 10.18653/v1/2022.acl-long.308 10.1109/JBHI.2020.3034158 10.1109/ICASSP39728.2021.9414567 10.1109/ICASSP.2018.8461326 10.1109/ICIP42928.2021.9506396 10.1109/SLT54892.2023.10022656 10.1109/EMBC46164.2021.9630373 10.1109/TNNLS.2022.3163771 10.1016/j.inffus.2022.09.006 10.1016/j.eswa.2024.124159 10.1109/ICASSP.2018.8462015 10.1109/TNNLS.2022.3202842 10.23919/Eusipco47968.2020.9287841 10.1109/TASLP.2020.3039600 10.1109/ICASSP49357.2023.10096889 10.1109/ICASSP.2015.7178964 10.1109/ICASSP48485.2024.10448106 10.1609/aaai.v37i11.26484 10.1007/s11760-019-01630-1 10.1109/TMM.2022.3185894 10.1109/ICASSP.2019.8683733 10.1109/CVPRW56347.2022.00504 10.1016/j.compeleceng.2021.107026 10.1007/s10489-024-05380-7 10.1109/CVPR52733.2024.02567 10.1109/ICASSP48485.2024.10446769 10.1007/s10489-024-05381-6 10.1109/CVPR42600.2020.01271 10.1109/ICIP40778.2020.9190894 10.1016/j.bspc.2022.104298 10.1016/0167-6393(93)90095-3 10.1109/TCDS.2023.3316701 10.1109/ASYU52992.2021.9599016 10.1109/ICASSP39728.2021.9415063 10.1561/116.00000050 10.1109/TASLP.2020.2998279 10.1109/CVPR52729.2023.01018 10.1007/s10489-021-02846-w 10.1109/ICASSP39728.2021.9415077 |
ContentType | Journal Article |
Copyright | Copyright Springer Nature B.V. Feb 2025 |
Copyright_xml | – notice: Copyright Springer Nature B.V. Feb 2025 |
DBID | AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1007/s10489-024-06119-0 |
DatabaseName | CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
EISSN | 1573-7497 |
ExternalDocumentID | 10_1007_s10489_024_06119_0 |
GroupedDBID | -Y2 -~C -~X .86 .DC .VR 06D 0R~ 0VY 1N0 1SB 2.D 203 23M 28- 2J2 2JN 2JY 2KG 2LR 2P1 2VQ 2~H 30V 4.4 406 408 409 40D 40E 5GY 5QI 5VS 67Z 6NX 77K 7WY 8FE 8FG 8FL 8TC 8UJ 95- 95. 95~ 96X AAAVM AABHQ AACDK AAHNG AAIAL AAJBT AAJKR AANZL AAOBN AAPKM AARHV AARTL AASML AATNV AATVU AAUYE AAWCG AAYIU AAYQN AAYTO AAYXX AAYZH ABAKF ABBBX ABBRH ABBXA ABDBE ABDZT ABECU ABFSG ABFTV ABHLI ABHQN ABIVO ABJCF ABJNI ABJOX ABKCH ABKTR ABMNI ABMQK ABNWP ABQBU ABQSL ABSXP ABTEG ABTHY ABTKH ABTMW ABULA ABUWG ABWNU ABXPI ACAOD ACBXY ACDTI ACGFS ACHSB ACHXU ACIWK ACKNC ACMDZ ACMLO ACOKC ACOMO ACPIV ACSNA ACSTC ACZOJ ADHHG ADHIR ADHKG ADIMF ADKFA ADKNI ADKPE ADRFC ADTPH ADURQ ADYFF ADZKW AEBTG AEFIE AEFQL AEGAL AEGNC AEJHL AEJRE AEKMD AEMSY AENEX AEOHA AEPYU AESKC AETLH AEVLU AEXYK AEZWR AFBBN AFDZB AFEXP AFGCZ AFHIU AFKRA AFLOW AFOHR AFQWF AFWTZ AFZKB AGAYW AGDGC AGGDS AGJBK AGMZJ AGQEE AGQMX AGQPQ AGRTI AGWIL AGWZB AGYKE AHAVH AHBYD AHKAY AHPBZ AHSBF AHWEU AHYZX AIAKS AIGIU AIIXL AILAN AITGF AIXLP AJBLW AJRNO AJZVZ ALMA_UNASSIGNED_HOLDINGS ALWAN AMKLP AMXSW AMYLF AMYQR AOCGG ARAPS ARMRJ ASPBG ATHPR AVWKF AXYYD AYFIA AYJHY AZFZN AZQEC B-. BA0 BBWZM BDATZ BENPR BEZIV BGLVJ BGNMA BPHCQ BSONS CAG CCPQU CITATION COF CS3 CSCUP DDRTE DL5 DNIVK DPUIP DWQXO EBLON EBS EIOEI EJD ESBYG FEDTE FERAY FFXSO FIGPU FINBP FNLPD FRNLG FRRFC FSGXE FWDCC GGCAI GGRSB GJIRD GNUQQ GNWQR GQ7 GQ8 GXS H13 HCIFZ HF~ HG5 HG6 HMJXF HQYDN HRMNR HVGLF HZ~ I09 IHE IJ- IKXTQ ITM IWAJR IXC IZIGR IZQ I~X I~Z J-C J0Z JBSCW JCJTX JZLTJ K60 K6V K6~ K7- KDC KOV KOW L6V LAK LLZTM M0C M4Y M7S MA- N2Q N9A NB0 NDZJH NPVJJ NQJWS NU0 O9- O93 O9G O9I O9J OAM OVD P19 P2P P62 P9O PF0 PHGZM PHGZT PQBIZ PQBZA PQQKQ PROAC PSYQQ PT4 PT5 PTHSS Q2X QOK QOS R4E R89 R9I RHV RNI RNS ROL RPX RSV RZC RZE RZK S16 S1Z S26 S27 S28 S3B SAP SCJ SCLPG SCO SDH SDM SHX SISQX SJYHP SNE SNPRN SNX SOHCF SOJ SPISZ SRMVM SSLCW STPWE SZN T13 T16 TEORI TSG TSK TSV TUC U2A UG4 UOJIU UTJUX UZXMN VC2 VFIZW W23 W48 WK8 YLTOR Z45 ZMTXR ZY4 ~A9 ~EX 7SC 8FD ABRTQ JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c226t-9411d27eb0d342d4fbccb6ab7fd0d95fa3cff9a8c9ed4e8c4b8076e9ecd00e873 |
ISSN | 0924-669X |
IngestDate | Fri Jul 25 12:16:38 EDT 2025 Tue Jul 01 03:32:03 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c226t-9411d27eb0d342d4fbccb6ab7fd0d95fa3cff9a8c9ed4e8c4b8076e9ecd00e873 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 |
ORCID | 0000-0002-2147-9888 |
PQID | 3149314017 |
PQPubID | 326365 |
ParticipantIDs | proquest_journals_3149314017 crossref_primary_10_1007_s10489_024_06119_0 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2025-02-00 20250201 |
PublicationDateYYYYMMDD | 2025-02-01 |
PublicationDate_xml | – month: 02 year: 2025 text: 2025-02-00 |
PublicationDecade | 2020 |
PublicationPlace | Boston |
PublicationPlace_xml | – name: Boston |
PublicationTitle | Applied intelligence (Dordrecht, Netherlands) |
PublicationYear | 2025 |
Publisher | Springer Nature B.V |
Publisher_xml | – name: Springer Nature B.V |
References | 6119_CR38 6119_CR39 6119_CR30 6119_CR32 6119_CR33 6119_CR34 6119_CR35 6119_CR36 D Yu (6119_CR47) 2016 C Fan (6119_CR8) 2021; 29 Y Zhang (6119_CR52) 2023; 15 Q Song (6119_CR37) 2022; 34 S Sarkar (6119_CR9) 2024; 54 6119_CR27 6119_CR28 6119_CR29 AK Gupta (6119_CR10) 2022; 52 G Chen (6119_CR3) 2020; 50 6119_CR20 6119_CR21 6119_CR22 6119_CR23 6119_CR24 6119_CR25 6119_CR26 AB Hassanat (6119_CR11) 2011; 1 X Chen (6119_CR5) 2023; 53 D Zhou (6119_CR53) 2023; 25 6119_CR16 6119_CR17 6119_CR18 6119_CR19 D Zhou (6119_CR54) 2024; 35 L Deng (6119_CR6) 2013; 21 6119_CR55 6119_CR12 6119_CR14 LA Passos (6119_CR31) 2023; 90 6119_CR15 6119_CR50 6119_CR51 K Ding (6119_CR7) 2024; 54 6119_CR49 X Chen (6119_CR4) 2020; 14 6119_CR41 6119_CR42 6119_CR44 6119_CR45 6119_CR46 NS Jong (6119_CR13) 2020; 25 6119_CR48 ZQ Wang (6119_CR43) 2020; 28 6119_CR2 6119_CR1 6119_CR40 |
References_xml | – ident: 6119_CR36 doi: 10.1109/CVPR.2017.367 – ident: 6119_CR12 doi: 10.1109/CVPR52729.2023.01801 – ident: 6119_CR24 doi: 10.1109/ICASSP40776.2020.9053841 – ident: 6119_CR18 doi: 10.1109/ICPR48806.2021.9412454 – volume: 50 start-page: 3503 year: 2020 ident: 6119_CR3 publication-title: Applied Intell doi: 10.1007/s10489-020-01725-0 – ident: 6119_CR40 – volume: 53 start-page: 335 issue: 2 year: 2023 ident: 6119_CR5 publication-title: IEEE Trans Human-Mach Syst doi: 10.1109/THMS.2022.3226197 – ident: 6119_CR45 doi: 10.3389/fnbot.2022.971446 – volume: 21 start-page: 1060 issue: 5 year: 2013 ident: 6119_CR6 publication-title: IEEE Trans Audio, Speech, and Language Process doi: 10.1109/TASL.2013.2244083 – ident: 6119_CR28 doi: 10.18653/v1/2022.acl-long.308 – volume: 25 start-page: 1997 issue: 6 year: 2020 ident: 6119_CR13 publication-title: IEEE J Biomed Health Inf doi: 10.1109/JBHI.2020.3034158 – volume-title: Automatic Speech Recognition year: 2016 ident: 6119_CR47 – ident: 6119_CR35 – ident: 6119_CR20 doi: 10.1109/ICASSP39728.2021.9414567 – ident: 6119_CR32 doi: 10.1109/ICASSP.2018.8461326 – ident: 6119_CR51 doi: 10.1109/ICIP42928.2021.9506396 – ident: 6119_CR14 doi: 10.1109/SLT54892.2023.10022656 – ident: 6119_CR44 doi: 10.1109/EMBC46164.2021.9630373 – volume: 34 start-page: 10028 issue: 12 year: 2022 ident: 6119_CR37 publication-title: IEEE Trans Neural Netw Learn Syst doi: 10.1109/TNNLS.2022.3163771 – volume: 90 start-page: 1 year: 2023 ident: 6119_CR31 publication-title: Inf Fusion doi: 10.1016/j.inffus.2022.09.006 – ident: 6119_CR33 doi: 10.1016/j.eswa.2024.124159 – ident: 6119_CR50 doi: 10.1109/ICASSP.2018.8462015 – volume: 35 start-page: 5226 issue: 4 year: 2024 ident: 6119_CR54 publication-title: IEEE Trans Neural Netw Learn Syst doi: 10.1109/TNNLS.2022.3202842 – ident: 6119_CR49 doi: 10.23919/Eusipco47968.2020.9287841 – volume: 29 start-page: 198 year: 2021 ident: 6119_CR8 publication-title: IEEE/ACM Trans Audio, Speech, and Language Process doi: 10.1109/TASLP.2020.3039600 – ident: 6119_CR15 – ident: 6119_CR21 doi: 10.1109/ICASSP49357.2023.10096889 – ident: 6119_CR29 doi: 10.1109/ICASSP.2015.7178964 – volume: 1 start-page: 279 year: 2011 ident: 6119_CR11 publication-title: Speech and Language Technol. – ident: 6119_CR48 doi: 10.1109/ICASSP48485.2024.10448106 – ident: 6119_CR2 doi: 10.1609/aaai.v37i11.26484 – volume: 14 start-page: 981 issue: 5 year: 2020 ident: 6119_CR4 publication-title: Signal, Image and Video Process doi: 10.1007/s11760-019-01630-1 – ident: 6119_CR23 – volume: 25 start-page: 4986 year: 2023 ident: 6119_CR53 publication-title: IEEE Trans Multimed doi: 10.1109/TMM.2022.3185894 – ident: 6119_CR55 doi: 10.1109/ICASSP.2019.8683733 – ident: 6119_CR27 doi: 10.1109/CVPRW56347.2022.00504 – ident: 6119_CR46 doi: 10.1016/j.compeleceng.2021.107026 – volume: 54 start-page: 4507 issue: 6 year: 2024 ident: 6119_CR9 publication-title: Applied Intell doi: 10.1007/s10489-024-05380-7 – ident: 6119_CR25 doi: 10.1109/CVPR52733.2024.02567 – ident: 6119_CR41 doi: 10.1109/ICASSP48485.2024.10446769 – volume: 54 start-page: 5674 issue: 7 year: 2024 ident: 6119_CR7 publication-title: Applied Intell doi: 10.1007/s10489-024-05381-6 – ident: 6119_CR42 doi: 10.1109/CVPR42600.2020.01271 – ident: 6119_CR17 doi: 10.1109/ICIP40778.2020.9190894 – ident: 6119_CR38 doi: 10.1016/j.bspc.2022.104298 – ident: 6119_CR39 doi: 10.1016/0167-6393(93)90095-3 – volume: 15 start-page: 2282 issue: 4 year: 2023 ident: 6119_CR52 publication-title: IEEE Trans Cogn Development Syst doi: 10.1109/TCDS.2023.3316701 – ident: 6119_CR1 doi: 10.1109/ASYU52992.2021.9599016 – ident: 6119_CR19 doi: 10.1109/ICASSP39728.2021.9415063 – ident: 6119_CR16 doi: 10.1561/116.00000050 – volume: 28 start-page: 1778 year: 2020 ident: 6119_CR43 publication-title: IEEE/ACM Trans Audio, Speech, and Language Process doi: 10.1109/TASLP.2020.2998279 – ident: 6119_CR22 doi: 10.1109/ICASSP49357.2023.10096889 – ident: 6119_CR26 doi: 10.1109/CVPR52729.2023.01018 – volume: 52 start-page: 9001 issue: 8 year: 2022 ident: 6119_CR10 publication-title: Applied Intell doi: 10.1007/s10489-021-02846-w – ident: 6119_CR30 doi: 10.1109/ICASSP39728.2021.9415077 – ident: 6119_CR34 |
SSID | ssj0003301 |
Score | 2.3757048 |
Snippet | Speech recognition is a major communication channel for human-machine interaction with outstanding breakthroughs. However, the practicality of single-modal... |
SourceID | proquest crossref |
SourceType | Aggregation Database Index Database |
StartPage | 224 |
SubjectTerms | Audio data Audio signals Speech Speech recognition Voice recognition |
Title | DuAGNet: an unrestricted multimodal speech recognition framework using dual adaptive gating fusion |
URI | https://www.proquest.com/docview/3149314017 |
Volume | 55 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1Lb9QwELZge-FCeRRRaJEP3FapvLHXTnrbdrssD62QaKXeLL8iOLBd0c2lv77jV5KWCgGXaJVok8ifM54Zf98MQu_1pKk1XCyU5aZgyii_SQjGcMq1IA34FC6wLVZ8ecE-XU4v-7aKQV2y1Ufm5kFdyf-gCucAV6-S_Qdku5vCCfgN-MIREIbjX2E8b2cfVi4k9-AzbX3lHl9w3zuRgSj488p6OcjGOfN93FGFPLcwU7LGbcgVBEGWsmoTiES-6IbnV7bXGbRcpja5rD-GdTzBRZ1DBAu3j1H-QEKc0wwpq1BOMxH5XlbRU6b9Rkaneompw5IVnIcWuLCGJMMpaCFY5NpmyxoL8KYZRIdmMuqmfzPfJMuZmSdywVPA2_Aiq36xyhv0y9k3-XW-kF8-rj4_RjslBAnlCO3MFicnq24lpjS0v-7eN4mmknTy3jPuOiZ31-XgbJw_Q09TlIBnEfLn6JFbv0C7uQMHTgb5JdJpBhxjtcZD_HGPP4744wH-uMMfB_yxxx9n_HHEH0f899DF4uz8dFmkrhmFAVd6W9RsMrGlcJpYykrLGm2M5kqLxhJbTxtFTdPUqjK1s8xVhumKCO5qZywhrhL0FRqtr9buNcK8ZpWFf1TEVqzhXAmtCbOKEMudq-g-Guchk5tYHEX2ZbD9AEsYYBkGWJJ9dJBHVaaP6FpSiNCpD_LFmz9ffoue9PP0AI22v1p3CP7gVr9LsN8Cgo1kJw |
linkProvider | Library Specific Holdings |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DuAGNet%3A+an+unrestricted+multimodal+speech+recognition+framework+using+dual+adaptive+gating+fusion&rft.jtitle=Applied+intelligence+%28Dordrecht%2C+Netherlands%29&rft.date=2025-02-01&rft.pub=Springer+Nature+B.V&rft.issn=0924-669X&rft.eissn=1573-7497&rft.volume=55&rft.issue=3&rft.spage=224&rft_id=info:doi/10.1007%2Fs10489-024-06119-0&rft.externalDBID=HAS_PDF_LINK |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0924-669X&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0924-669X&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0924-669X&client=summon |