Mutual Information Based Dynamic Integration of Multiple Feature Streams for Robust Real-Time LVCSR

We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a v...

Full description

Saved in:
Bibliographic Details
Published inIEICE Transactions on Information and Systems Vol. E91.D; no. 3; pp. 815 - 824
Main Authors SATO, Shoei, KOBAYASHI, Akio, ONOE, Kazuo, HOMMA, Shinichi, IMAI, Toru, TAKAGI, Tohru, KOBAYASHI, Tetsunori
Format Journal Article
LanguageEnglish
Published Oxford The Institute of Electronics, Information and Communication Engineers 2008
Oxford University Press
Subjects
Online AccessGet full text

Cover

Loading…
Abstract We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to show discriminative ability. A conventional method proposed for the recognition of spoken digits calculates the weights front the entropy of the whole set of HMM states. This paper extends the dynamic weighting to a real-time large-vocabulary continuous speech recognition (LVCSR) system. The proposed weight is calculated in real-time from mutual information between an input stream and active HMM states in a searchs pace without an additional likelihood calculation. Furthermore, the mutual information takes the width of the search space into account by calculating the marginal entropy from the number of active states. In this paper, we integrate three features that are extracted through auditory filters by taking into account the human auditory system's ability to extract amplitude and frequency modulations. Due to this, features representing energy, amplitude drift, and resonant frequency drifts, are integrated. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments on field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9.2% in field reports and 4.7% in spontaneous commentaries relative to the best result obtained from a single stream.
AbstractList We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to show discriminative ability. A conventional method proposed for the recognition of spoken digits calculates the weights front the entropy of the whole set of HMM states. This paper extends the dynamic weighting to a real-time large-vocabulary continuous speech recognition (LVCSR) system. The proposed weight is calculated in real-time from mutual information between an input stream and active HMM states in a searchs pace without an additional likelihood calculation. Furthermore, the mutual information takes the width of the search space into account by calculating the marginal entropy from the number of active states. In this paper, we integrate three features that are extracted through auditory filters by taking into account the human auditory system's ability to extract amplitude and frequency modulations. Due to this, features representing energy, amplitude drift, and resonant frequency drifts, are integrated. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments on field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9.2% in field reports and 4.7% in spontaneous commentaries relative to the best result obtained from a single stream.
Author HOMMA, Shinichi
IMAI, Toru
TAKAGI, Tohru
ONOE, Kazuo
KOBAYASHI, Akio
SATO, Shoei
KOBAYASHI, Tetsunori
Author_xml – sequence: 1
  fullname: SATO, Shoei
  organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories
– sequence: 2
  fullname: KOBAYASHI, Akio
  organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories
– sequence: 3
  fullname: ONOE, Kazuo
  organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories
– sequence: 4
  fullname: HOMMA, Shinichi
  organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories
– sequence: 5
  fullname: IMAI, Toru
  organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories
– sequence: 6
  fullname: TAKAGI, Tohru
  organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories
– sequence: 7
  fullname: KOBAYASHI, Tetsunori
  organization: Waseda University
BackLink http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=20214950$$DView record in Pascal Francis
BookMark eNo9kM1u2zAQhIkiAeokfYGeeCnQixwuacrSsXV-msJBASftlVhTq5QBRbkkdfDbl4VSn_Yw38wO5oKdhTEQYx9BLEG06tpRdul4TS1U3VItG9Dv2ALWK12BquGMLUQLddVoJd-zi5RehYBGgl4w-zjlCT1_CP0YB8xuDPwrJur4zTHg4GxRMr3EWRl7_jj57A6e-B1hniLxpxwJh8SLn-_G_ZQy3xH66tkNxLe_Nk-7K3beo0_04e1esp93t8-bb9X2x_3D5su2slq1ubKl9b5U7ISwte6arkZpAVq0Vipcaau7niTUsl9j00O93zcAGteSVGtbtVKX7POce4jjn4lSNoNLlrzHQOOUDNRrkLKWUhZUzqiNY0qRenOIbsB4NCDMv0XNvKgpi5rOKFO6FdOnt3xMFn0fMViXTk4pJKxaLQr3feZeU8YXOgEYs7OeTC7G5EJvbkv4TQn_f8uTE2R_YzQU1F_-apXN
ContentType Journal Article
Copyright 2008 The Institute of Electronics, Information and Communication Engineers
2008 INIST-CNRS
Copyright_xml – notice: 2008 The Institute of Electronics, Information and Communication Engineers
– notice: 2008 INIST-CNRS
DBID IQODW
AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
DOI 10.1093/ietisy/e91-d.3.815
DatabaseName Pascal-Francis
CrossRef
Computer and Information Systems Abstracts
Technology Research Database
ProQuest Computer Science Collection
Advanced Technologies Database with Aerospace
Computer and Information Systems Abstracts – Academic
Computer and Information Systems Abstracts Professional
DatabaseTitle CrossRef
Computer and Information Systems Abstracts
Technology Research Database
Computer and Information Systems Abstracts – Academic
Advanced Technologies Database with Aerospace
ProQuest Computer Science Collection
Computer and Information Systems Abstracts Professional
DatabaseTitleList Computer and Information Systems Abstracts

DeliveryMethod fulltext_linktorsrc
Discipline Engineering
Computer Science
Applied Sciences
EISSN 1745-1361
EndPage 824
ExternalDocumentID 10_1093_ietisy_e91_d_3_815
20214950
article_transinf_E91_D_3_E91_D_3_815_article_char_en
GroupedDBID -~X
1TH
5GY
ABQTQ
ABZEH
ACGFS
ADNWM
AENEX
AFFNX
ALMA_UNASSIGNED_HOLDINGS
CKLRP
CS3
DU5
EBS
EJD
F5P
ICE
JSF
JSH
KQ8
OK1
P2P
RJT
RYL
RZJ
TN5
TQK
ZKX
ABTAH
C1A
H13
IQODW
RIG
VOH
ZE2
ZY4
AAYXX
CITATION
7SC
8FD
JQ2
L7M
L~C
L~D
ID FETCH-LOGICAL-c539t-c815b853d00c65d8d6a2c119acc23a45c5dfe2162f7a8f16bb8115a72e39c9343
ISSN 0916-8532
1745-1361
IngestDate Sat Aug 17 02:30:23 EDT 2024
Fri Aug 23 02:38:50 EDT 2024
Sun Oct 22 16:08:35 EDT 2023
Wed Apr 05 04:59:50 EDT 2023
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 3
Keywords Discriminant analysis
Information integration
Frequency modulation
Probabilistic approach
active hypotheses
Amplitude modulation
Entropy
Cable television
Background noise
Algorithm
Japanese
stream integration
Weighting
Audiovisual document
Resonance frequency
Hidden Markov models
News
Speech recognition
Frequency drift
Signal processing
Feature extraction
Speech processing
Mutual information
Language English
License CC BY 4.0
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c539t-c815b853d00c65d8d6a2c119acc23a45c5dfe2162f7a8f16bb8115a72e39c9343
Notes ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
OpenAccessLink https://www.jstage.jst.go.jp/article/transinf/E91.D/3/E91.D_3_815/_article/-char/en
PQID 1671226222
PQPubID 23500
PageCount 10
ParticipantIDs proquest_miscellaneous_1671226222
crossref_primary_10_1093_ietisy_e91_d_3_815
pascalfrancis_primary_20214950
jstage_primary_article_transinf_E91_D_3_E91_D_3_815_article_char_en
PublicationCentury 2000
PublicationDate 2008-00-00
PublicationDateYYYYMMDD 2008-01-01
PublicationDate_xml – year: 2008
  text: 2008-00-00
PublicationDecade 2000
PublicationPlace Oxford
PublicationPlace_xml – name: Oxford
PublicationTitle IEICE Transactions on Information and Systems
PublicationTitleAlternate IEICE Trans. Inf. & Syst.
PublicationYear 2008
Publisher The Institute of Electronics, Information and Communication Engineers
Oxford University Press
Publisher_xml – name: The Institute of Electronics, Information and Communication Engineers
– name: Oxford University Press
References [16] Y. Wang, S. Greenberg, J. Swaminathan, R. Kumaresan, and D. Poeppel, “Comprehensive modulation representation for automatic speech recognition,” Interspeech, pp. 3025-3028, 2005.
[22] R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex sounds and auditory images,” Auditory Physiology and Perception, Proc. 9th International Symposium on Hearing, pp. 429-446, 1992.
[13] H. C. Hart, A. R. Palmer, and D. A. Hall, “Amplitude and frequency-modulated stimuli activate common regions of human auditory cortex,” Cerebral Cortex, vol. 13, pp. 773-781, 2003.
[11] H. Misra, H. Bourlard, and V. Tyagi, “New entropy based combination rules in HMM/ANN multi-stream ASR,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. II-741-744, 2003.
[4] N. Kumar and A. G. Andreou, “Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition,” Speech Commun., vol. 26, pp. 283-297, 1998.
[24] A. Ando, T. Imai, A. Kobayashi, H. Isono, and K. Nakabayashi, “Real-time transcription system for simultaneous subtitling of Japanese broadcast news programs,” IEEE Trans. Broadcast., vol. 46, no. 3, pp. 189-196, 2000.
[20] G. Potamianos and C. Neti, “Stream confidence estimation for audio-visual speech recognition,” Proc. Int. Conf. Spoken Language Processing, pp. III-746-749, 2000.
[23] M. Slaney, “An efficient implementation of the Patterson-Holdworth auditory filter bank,” Apple Computer Technical Report #35, 1993.
[21] A. Adjoudani and C. Benoit, “On the integration of audio and visual parameters in an HMM-based ASR,” in Speechreading by Humans and Machines, eds. D. G. Stork and M. E. Hennecke, pp. 461-471, Springer, Berlin, 1996.
[26] A. C. Morris, A. Hagen, H. Glothin, and H. Bourlard, “Multi-stream adaptive evidence combination for robust ASR,” Speech Commun., vol. 34, pp. 25-40, 2001.
[15] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal modulations with application to speech analysis,” IEEE Trans. Signal Process., vol. 41, no. 10, pp. 3024-3051, Oct. 1993.
[14] D. Dimitriadis, P. Maragos, and A. Potamianos, “Auditory teager energy cepstrum coefficients for robust speech recognition,” Interspeech, pp. 3013-3016, 2005.
[18] K. Nie, G. Stickney, and F. G. Zeng, “Encoding frequency modulation to improve cochlea implant performance in noise,” IEEE Trans. Biomed. Eng., vol. 52, no. 1, pp. 64-73, Jan. 2005.
[19] A. Garg, G. Potamianos, C. Neti, and T. S. Huang, “Frame-dependent multi-stream reliability indicators for audio-visual speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal. Process., pp. I-24-27, 2003.
[6] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Trans. Speech Audio Process., vol. 7, pp. 272-281, 1999.
[25] D. E. Ferguson, “Fibonaccian searching,” Commun. ACM, vol. 3, no. 12, p. 648, 1960.
[17] D. Dimitriadis, P. Maragos, and A. Potamianos, “Robust AM-FM features for speech recognition,” IEEE Signal Process. Lett., vol. 12, no. 9, pp. 621-624, Sept. 2005.
[3] R. Haeb-Umbach and H. Ney, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I-13-16, 1992.
[12] L. Lee, Y. Chen, and C. Wan, “Entropy-based feature parameter weighting for robust speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I-41-44, 2006.
[9] G. Potamianos and H. P. Graf, “Discriminative training of HMM stream exponents for audio-visual speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 6-3733-3736, 1998.
[1] T. Imai, A. Kobayashi, S. Sato, S. Homma, K. Onoe, and T. Kobayakawa, “Speech recognition for subtitling Japanese live broadcasts,” 18th International Congress on Acoustics (ICA), pp. I-165-168, 2004.
[8] Y. L. Chow, “Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 701-704, 1990.
[2] O. Scharenborg, “Reaching over the gap: A review of efforts to link human and automatic speech recognition research,” Speech Commun., vol. 49, pp. 336-347, 2007.
[10] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, “Maximum entropy and MCE based HMM stream weight estimation for audiovisual ASR,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I-853-856, 2002.
[27] C. P. Cox, A Handbook of Introductory Statistical Methods, pp. 46-50, John Wiley & Sons, 1987.
[7] J. Hernando, “Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition,” Proc., IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1267-1270, 1997.
[5] R. A. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classification,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. II-661-664, 1998.
References_xml
SSID ssj0018215
Score 1.7861643
Snippet We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition....
SourceID proquest
crossref
pascalfrancis
jstage
SourceType Aggregation Database
Index Database
Publisher
StartPage 815
SubjectTerms active hypotheses
Applied sciences
Artificial intelligence
Computer science; control theory; systems
Dynamical systems
Dynamics
Entropy
Exact sciences and technology
Information, signal and communications theory
Mathematical analysis
Modulation, demodulation
mutual information
Real time
Searching
Signal and communications theory
Signal processing
Speech and sound recognition and synthesis. Linguistics
Speech processing
Speech recognition
stream integration
Streams
Systems, networks and services of telecommunications
Telecommunications
Telecommunications and information theory
Transmission and modulation (techniques and equipments)
Title Mutual Information Based Dynamic Integration of Multiple Feature Streams for Robust Real-Time LVCSR
URI https://www.jstage.jst.go.jp/article/transinf/E91.D/3/E91.D_3_815/_article/-char/en
https://search.proquest.com/docview/1671226222
Volume E91.D
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
ispartofPNX IEICE Transactions on Information and Systems, 2008/03/01, Vol.E91.D(3), pp.815-824
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1bb9MwFLZg8ABCXAaIcpmMxFuVrrFz8yN0nTrWDsFa1DfLdhzWjSXT2j5sv57jS7JU42HwklbWcezkfDk-Pj4XhD7JlAnCBAtgac-CiPVFwJgMg0IyUN8j2Y-oiUaeHCWjWfR1Hs9vXHltdMlK9tT1X-NK_oer0AZ8NVGy_8DZ5qbQAP-Bv3AFDsP1TjyerG30hw8pspz8AqtS3t1zdeatve_XZaMVTmrvQaP4mZMDcybtUzIYH-v10vgoi9-BCQzpjn8Ojn-0ldeD4cFgaIpK1BXG7VHDojW6tcK3cqAb042whZq6xyeVXjTiHaTIlSnjZEXTmfMFs7be0rkmHorrddM4qs7PhbuHCeM8WWyYKtpyNY3iIKQu77qHFG3JzczFdNZLsAurviXdXearhV4tlleG4SwM8h7tNZ3bybSPvvH92XjMp8P59D56QEAOGY_Pw-83h0wZsQUumrn5mCoYZdeNsdsaYUNveXgKqrvJyfDkQizhWypcEZRb67lVUqbP0VO_u8CfHVReoHu63EbP6sod2AvybfS4lYbyJVIOR7iFI2xxhD2OcAtHuCpwjSPscYQ9jjD0xw5HuMERtjh6hWb7w-lgFPjiG4GKKVsFCp5Zgi6X9_sqifMsTwRRYciEUoSKKFZxXmgSJqRIRVaEiZQZbC5ESjRlitGIvkZbZVXqNwjrglAa65QoEkcij4Aw07ARSKXWRCZxB3XrV8svXI4V7nwjKHeM4MAInnPKYVIdNHBvv6H13x-3-AfA8CFQ7wF1_Qu9GiITzAiyo4N2NljX3IyYJIIs7nfQx5qXHCSuOUYTpa7WSw4TD0GygWL99g4079Aj52JkrHbv0dbqcq0_gB67kjsWjX8AIAWkeA
link.rule.ids 315,783,787,4031,27935,27936,27937
linkProvider Colorado Alliance of Research Libraries
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Mutual+Information+Based+Dynamic+Integration+of+Multiple+Feature+Streams+for+Robust+Real-Time+LVCSR&rft.jtitle=IEICE+transactions+on+information+and+systems&rft.au=Sato%2C+Shoei&rft.au=Kobayashi%2C+Akio&rft.au=Onoe%2C+Kazuo&rft.au=Homma%2C+Shinichi&rft.date=2008&rft.issn=1745-1361&rft.issue=3&rft.spage=815&rft.epage=824&rft_id=info:doi/10.1093%2Fietisy%2Fe91-d.3.815&rft.externalDBID=NO_FULL_TEXT
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0916-8532&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0916-8532&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0916-8532&client=summon