Mutual Information Based Dynamic Integration of Multiple Feature Streams for Robust Real-Time LVCSR
We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a v...
Saved in:
Published in | IEICE Transactions on Information and Systems Vol. E91.D; no. 3; pp. 815 - 824 |
---|---|
Main Authors | , , , , , , |
Format | Journal Article |
Language | English |
Published |
Oxford
The Institute of Electronics, Information and Communication Engineers
2008
Oxford University Press |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to show discriminative ability. A conventional method proposed for the recognition of spoken digits calculates the weights front the entropy of the whole set of HMM states. This paper extends the dynamic weighting to a real-time large-vocabulary continuous speech recognition (LVCSR) system. The proposed weight is calculated in real-time from mutual information between an input stream and active HMM states in a searchs pace without an additional likelihood calculation. Furthermore, the mutual information takes the width of the search space into account by calculating the marginal entropy from the number of active states. In this paper, we integrate three features that are extracted through auditory filters by taking into account the human auditory system's ability to extract amplitude and frequency modulations. Due to this, features representing energy, amplitude drift, and resonant frequency drifts, are integrated. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments on field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9.2% in field reports and 4.7% in spontaneous commentaries relative to the best result obtained from a single stream. |
---|---|
AbstractList | We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition. The integration algorithm dynamically calculates a frame-wise stream weight so that a higher weight is given to a stream that is robust to a variety of noisy environments or speaking styles. Such a robust stream is expected to show discriminative ability. A conventional method proposed for the recognition of spoken digits calculates the weights front the entropy of the whole set of HMM states. This paper extends the dynamic weighting to a real-time large-vocabulary continuous speech recognition (LVCSR) system. The proposed weight is calculated in real-time from mutual information between an input stream and active HMM states in a searchs pace without an additional likelihood calculation. Furthermore, the mutual information takes the width of the search space into account by calculating the marginal entropy from the number of active states. In this paper, we integrate three features that are extracted through auditory filters by taking into account the human auditory system's ability to extract amplitude and frequency modulations. Due to this, features representing energy, amplitude drift, and resonant frequency drifts, are integrated. These features are expected to provide complementary clues for speech recognition. Speech recognition experiments on field reports and spontaneous commentary from Japanese broadcast news showed that the proposed method reduced error words by 9.2% in field reports and 4.7% in spontaneous commentaries relative to the best result obtained from a single stream. |
Author | HOMMA, Shinichi IMAI, Toru TAKAGI, Tohru ONOE, Kazuo KOBAYASHI, Akio SATO, Shoei KOBAYASHI, Tetsunori |
Author_xml | – sequence: 1 fullname: SATO, Shoei organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories – sequence: 2 fullname: KOBAYASHI, Akio organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories – sequence: 3 fullname: ONOE, Kazuo organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories – sequence: 4 fullname: HOMMA, Shinichi organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories – sequence: 5 fullname: IMAI, Toru organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories – sequence: 6 fullname: TAKAGI, Tohru organization: NHK (Japan Broadcasting Corporation) Science and Technical Research Laboratories – sequence: 7 fullname: KOBAYASHI, Tetsunori organization: Waseda University |
BackLink | http://pascal-francis.inist.fr/vibad/index.php?action=getRecordDetail&idt=20214950$$DView record in Pascal Francis |
BookMark | eNo9kM1u2zAQhIkiAeokfYGeeCnQixwuacrSsXV-msJBASftlVhTq5QBRbkkdfDbl4VSn_Yw38wO5oKdhTEQYx9BLEG06tpRdul4TS1U3VItG9Dv2ALWK12BquGMLUQLddVoJd-zi5RehYBGgl4w-zjlCT1_CP0YB8xuDPwrJur4zTHg4GxRMr3EWRl7_jj57A6e-B1hniLxpxwJh8SLn-_G_ZQy3xH66tkNxLe_Nk-7K3beo0_04e1esp93t8-bb9X2x_3D5su2slq1ubKl9b5U7ISwte6arkZpAVq0Vipcaau7niTUsl9j00O93zcAGteSVGtbtVKX7POce4jjn4lSNoNLlrzHQOOUDNRrkLKWUhZUzqiNY0qRenOIbsB4NCDMv0XNvKgpi5rOKFO6FdOnt3xMFn0fMViXTk4pJKxaLQr3feZeU8YXOgEYs7OeTC7G5EJvbkv4TQn_f8uTE2R_YzQU1F_-apXN |
ContentType | Journal Article |
Copyright | 2008 The Institute of Electronics, Information and Communication Engineers 2008 INIST-CNRS |
Copyright_xml | – notice: 2008 The Institute of Electronics, Information and Communication Engineers – notice: 2008 INIST-CNRS |
DBID | IQODW AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
DOI | 10.1093/ietisy/e91-d.3.815 |
DatabaseName | Pascal-Francis CrossRef Computer and Information Systems Abstracts Technology Research Database ProQuest Computer Science Collection Advanced Technologies Database with Aerospace Computer and Information Systems Abstracts Academic Computer and Information Systems Abstracts Professional |
DatabaseTitle | CrossRef Computer and Information Systems Abstracts Technology Research Database Computer and Information Systems Abstracts – Academic Advanced Technologies Database with Aerospace ProQuest Computer Science Collection Computer and Information Systems Abstracts Professional |
DatabaseTitleList | Computer and Information Systems Abstracts |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering Computer Science Applied Sciences |
EISSN | 1745-1361 |
EndPage | 824 |
ExternalDocumentID | 10_1093_ietisy_e91_d_3_815 20214950 article_transinf_E91_D_3_E91_D_3_815_article_char_en |
GroupedDBID | -~X 1TH 5GY ABQTQ ABZEH ACGFS ADNWM AENEX AFFNX ALMA_UNASSIGNED_HOLDINGS CKLRP CS3 DU5 EBS EJD F5P ICE JSF JSH KQ8 OK1 P2P RJT RYL RZJ TN5 TQK ZKX ABTAH C1A H13 IQODW RIG VOH ZE2 ZY4 AAYXX CITATION 7SC 8FD JQ2 L7M L~C L~D |
ID | FETCH-LOGICAL-c539t-c815b853d00c65d8d6a2c119acc23a45c5dfe2162f7a8f16bb8115a72e39c9343 |
ISSN | 0916-8532 1745-1361 |
IngestDate | Sat Aug 17 02:30:23 EDT 2024 Fri Aug 23 02:38:50 EDT 2024 Sun Oct 22 16:08:35 EDT 2023 Wed Apr 05 04:59:50 EDT 2023 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 3 |
Keywords | Discriminant analysis Information integration Frequency modulation Probabilistic approach active hypotheses Amplitude modulation Entropy Cable television Background noise Algorithm Japanese stream integration Weighting Audiovisual document Resonance frequency Hidden Markov models News Speech recognition Frequency drift Signal processing Feature extraction Speech processing Mutual information |
Language | English |
License | CC BY 4.0 |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c539t-c815b853d00c65d8d6a2c119acc23a45c5dfe2162f7a8f16bb8115a72e39c9343 |
Notes | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
OpenAccessLink | https://www.jstage.jst.go.jp/article/transinf/E91.D/3/E91.D_3_815/_article/-char/en |
PQID | 1671226222 |
PQPubID | 23500 |
PageCount | 10 |
ParticipantIDs | proquest_miscellaneous_1671226222 crossref_primary_10_1093_ietisy_e91_d_3_815 pascalfrancis_primary_20214950 jstage_primary_article_transinf_E91_D_3_E91_D_3_815_article_char_en |
PublicationCentury | 2000 |
PublicationDate | 2008-00-00 |
PublicationDateYYYYMMDD | 2008-01-01 |
PublicationDate_xml | – year: 2008 text: 2008-00-00 |
PublicationDecade | 2000 |
PublicationPlace | Oxford |
PublicationPlace_xml | – name: Oxford |
PublicationTitle | IEICE Transactions on Information and Systems |
PublicationTitleAlternate | IEICE Trans. Inf. & Syst. |
PublicationYear | 2008 |
Publisher | The Institute of Electronics, Information and Communication Engineers Oxford University Press |
Publisher_xml | – name: The Institute of Electronics, Information and Communication Engineers – name: Oxford University Press |
References | [16] Y. Wang, S. Greenberg, J. Swaminathan, R. Kumaresan, and D. Poeppel, “Comprehensive modulation representation for automatic speech recognition,” Interspeech, pp. 3025-3028, 2005. [22] R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, and M. Allerhand, “Complex sounds and auditory images,” Auditory Physiology and Perception, Proc. 9th International Symposium on Hearing, pp. 429-446, 1992. [13] H. C. Hart, A. R. Palmer, and D. A. Hall, “Amplitude and frequency-modulated stimuli activate common regions of human auditory cortex,” Cerebral Cortex, vol. 13, pp. 773-781, 2003. [11] H. Misra, H. Bourlard, and V. Tyagi, “New entropy based combination rules in HMM/ANN multi-stream ASR,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. II-741-744, 2003. [4] N. Kumar and A. G. Andreou, “Heteroscedastic discriminant analysis and reduced rank HMMs for improved speech recognition,” Speech Commun., vol. 26, pp. 283-297, 1998. [24] A. Ando, T. Imai, A. Kobayashi, H. Isono, and K. Nakabayashi, “Real-time transcription system for simultaneous subtitling of Japanese broadcast news programs,” IEEE Trans. Broadcast., vol. 46, no. 3, pp. 189-196, 2000. [20] G. Potamianos and C. Neti, “Stream confidence estimation for audio-visual speech recognition,” Proc. Int. Conf. Spoken Language Processing, pp. III-746-749, 2000. [23] M. Slaney, “An efficient implementation of the Patterson-Holdworth auditory filter bank,” Apple Computer Technical Report #35, 1993. [21] A. Adjoudani and C. Benoit, “On the integration of audio and visual parameters in an HMM-based ASR,” in Speechreading by Humans and Machines, eds. D. G. Stork and M. E. Hennecke, pp. 461-471, Springer, Berlin, 1996. [26] A. C. Morris, A. Hagen, H. Glothin, and H. Bourlard, “Multi-stream adaptive evidence combination for robust ASR,” Speech Commun., vol. 34, pp. 25-40, 2001. [15] P. Maragos, J. F. Kaiser, and T. F. Quatieri, “Energy separation in signal modulations with application to speech analysis,” IEEE Trans. Signal Process., vol. 41, no. 10, pp. 3024-3051, Oct. 1993. [14] D. Dimitriadis, P. Maragos, and A. Potamianos, “Auditory teager energy cepstrum coefficients for robust speech recognition,” Interspeech, pp. 3013-3016, 2005. [18] K. Nie, G. Stickney, and F. G. Zeng, “Encoding frequency modulation to improve cochlea implant performance in noise,” IEEE Trans. Biomed. Eng., vol. 52, no. 1, pp. 64-73, Jan. 2005. [19] A. Garg, G. Potamianos, C. Neti, and T. S. Huang, “Frame-dependent multi-stream reliability indicators for audio-visual speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal. Process., pp. I-24-27, 2003. [6] M. J. F. Gales, “Semi-tied covariance matrices for hidden Markov models,” IEEE Trans. Speech Audio Process., vol. 7, pp. 272-281, 1999. [25] D. E. Ferguson, “Fibonaccian searching,” Commun. ACM, vol. 3, no. 12, p. 648, 1960. [17] D. Dimitriadis, P. Maragos, and A. Potamianos, “Robust AM-FM features for speech recognition,” IEEE Signal Process. Lett., vol. 12, no. 9, pp. 621-624, Sept. 2005. [3] R. Haeb-Umbach and H. Ney, “Linear discriminant analysis for improved large vocabulary continuous speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I-13-16, 1992. [12] L. Lee, Y. Chen, and C. Wan, “Entropy-based feature parameter weighting for robust speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I-41-44, 2006. [9] G. Potamianos and H. P. Graf, “Discriminative training of HMM stream exponents for audio-visual speech recognition,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 6-3733-3736, 1998. [1] T. Imai, A. Kobayashi, S. Sato, S. Homma, K. Onoe, and T. Kobayakawa, “Speech recognition for subtitling Japanese live broadcasts,” 18th International Congress on Acoustics (ICA), pp. I-165-168, 2004. [8] Y. L. Chow, “Maximum mutual information estimation of HMM parameters for continuous speech recognition using the N-best algorithm,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. 701-704, 1990. [2] O. Scharenborg, “Reaching over the gap: A review of efforts to link human and automatic speech recognition research,” Speech Commun., vol. 49, pp. 336-347, 2007. [10] G. Gravier, S. Axelrod, G. Potamianos, and C. Neti, “Maximum entropy and MCE based HMM stream weight estimation for audiovisual ASR,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. I-853-856, 2002. [27] C. P. Cox, A Handbook of Introductory Statistical Methods, pp. 46-50, John Wiley & Sons, 1987. [7] J. Hernando, “Maximum likelihood weighting of dynamic speech features for CDHMM speech recognition,” Proc., IEEE Int. Conf. Acoust. Speech Signal Process., pp. 1267-1270, 1997. [5] R. A. Gopinath, “Maximum likelihood modeling with Gaussian distributions for classification,” Proc. IEEE Int. Conf. Acoust. Speech Signal Process., pp. II-661-664, 1998. |
References_xml | |
SSID | ssj0018215 |
Score | 1.7861643 |
Snippet | We present a novel method of integrating the likelihoods of multiple feature streams, representing different acoustic aspects, for robust speech recognition.... |
SourceID | proquest crossref pascalfrancis jstage |
SourceType | Aggregation Database Index Database Publisher |
StartPage | 815 |
SubjectTerms | active hypotheses Applied sciences Artificial intelligence Computer science; control theory; systems Dynamical systems Dynamics Entropy Exact sciences and technology Information, signal and communications theory Mathematical analysis Modulation, demodulation mutual information Real time Searching Signal and communications theory Signal processing Speech and sound recognition and synthesis. Linguistics Speech processing Speech recognition stream integration Streams Systems, networks and services of telecommunications Telecommunications Telecommunications and information theory Transmission and modulation (techniques and equipments) |
Title | Mutual Information Based Dynamic Integration of Multiple Feature Streams for Robust Real-Time LVCSR |
URI | https://www.jstage.jst.go.jp/article/transinf/E91.D/3/E91.D_3_815/_article/-char/en https://search.proquest.com/docview/1671226222 |
Volume | E91.D |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
ispartofPNX | IEICE Transactions on Information and Systems, 2008/03/01, Vol.E91.D(3), pp.815-824 |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV1bb9MwFLZg8ABCXAaIcpmMxFuVrrFz8yN0nTrWDsFa1DfLdhzWjSXT2j5sv57jS7JU42HwklbWcezkfDk-Pj4XhD7JlAnCBAtgac-CiPVFwJgMg0IyUN8j2Y-oiUaeHCWjWfR1Hs9vXHltdMlK9tT1X-NK_oer0AZ8NVGy_8DZ5qbQAP-Bv3AFDsP1TjyerG30hw8pspz8AqtS3t1zdeatve_XZaMVTmrvQaP4mZMDcybtUzIYH-v10vgoi9-BCQzpjn8Ojn-0ldeD4cFgaIpK1BXG7VHDojW6tcK3cqAb042whZq6xyeVXjTiHaTIlSnjZEXTmfMFs7be0rkmHorrddM4qs7PhbuHCeM8WWyYKtpyNY3iIKQu77qHFG3JzczFdNZLsAurviXdXearhV4tlleG4SwM8h7tNZ3bybSPvvH92XjMp8P59D56QEAOGY_Pw-83h0wZsQUumrn5mCoYZdeNsdsaYUNveXgKqrvJyfDkQizhWypcEZRb67lVUqbP0VO_u8CfHVReoHu63EbP6sod2AvybfS4lYbyJVIOR7iFI2xxhD2OcAtHuCpwjSPscYQ9jjD0xw5HuMERtjh6hWb7w-lgFPjiG4GKKVsFCp5Zgi6X9_sqifMsTwRRYciEUoSKKFZxXmgSJqRIRVaEiZQZbC5ESjRlitGIvkZbZVXqNwjrglAa65QoEkcij4Aw07ARSKXWRCZxB3XrV8svXI4V7nwjKHeM4MAInnPKYVIdNHBvv6H13x-3-AfA8CFQ7wF1_Qu9GiITzAiyo4N2NljX3IyYJIIs7nfQx5qXHCSuOUYTpa7WSw4TD0GygWL99g4079Aj52JkrHbv0dbqcq0_gB67kjsWjX8AIAWkeA |
link.rule.ids | 315,783,787,4031,27935,27936,27937 |
linkProvider | Colorado Alliance of Research Libraries |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Mutual+Information+Based+Dynamic+Integration+of+Multiple+Feature+Streams+for+Robust+Real-Time+LVCSR&rft.jtitle=IEICE+transactions+on+information+and+systems&rft.au=Sato%2C+Shoei&rft.au=Kobayashi%2C+Akio&rft.au=Onoe%2C+Kazuo&rft.au=Homma%2C+Shinichi&rft.date=2008&rft.issn=1745-1361&rft.issue=3&rft.spage=815&rft.epage=824&rft_id=info:doi/10.1093%2Fietisy%2Fe91-d.3.815&rft.externalDBID=NO_FULL_TEXT |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0916-8532&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0916-8532&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0916-8532&client=summon |