Robust speech recognition by integrating speech separation and hypothesis testing

Missing-data methods attempt to improve robust speech recognition by distinguishing between reliable and unreliable data in the time–frequency ( T– F) domain. Such methods require a binary mask to label speech-dominant T– F regions of a noisy speech signal as reliable and the rest as unreliable. Cur...

Full description

Saved in:

Bibliographic Details
Published in	Speech communication Vol. 52; no. 1; pp. 72 - 81
Main Authors	Srinivasan, Soundararajan, Wang, DeLiang
Format	Journal Article
Language	English
Published	Amsterdam Elsevier B.V 2010 Elsevier
Subjects	Applied sciences Cues Exact sciences and technology Ideal binary mask Information, signal and communications theory Lattices Masks Mathematical models Missing-data recognizer Recognition Robust speech recognition Separation Signal processing Speech Speech processing Speech recognition Speech segregation Telecommunications and information theory Top-down processing Speech segregation Top-down processing Robust speech recognition Ideal binary mask Missing-data recognizer Performance evaluation Multistage method Acoustic signal Labelling Frequency domain method F region Hypothesis test Missing data Accuracy Vocal signal Speech recognition Speech processing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Missing-data methods attempt to improve robust speech recognition by distinguishing between reliable and unreliable data in the time–frequency ( T– F) domain. Such methods require a binary mask to label speech-dominant T– F regions of a noisy speech signal as reliable and the rest as unreliable. Current methods for computing the mask are based mainly on bottom-up cues such as harmonicity and produce labeling errors that degrade recognition performance. In this paper, we propose a two-stage recognition system that combines bottom-up and top-down cues in order to simultaneously improve both mask estimation and recognition accuracy. First, an n-best lattice consistent with a speech separation mask is generated. The lattice is then re-scored by expanding the mask using a model-based hypothesis test to determine the reliability of individual T– F units. Systematic evaluations of the proposed system show significant improvement in recognition performance compared to that using speech separation alone.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	0167-6393 1872-7182
DOI:	10.1016/j.specom.2009.08.008