Techniques for handling convolutional distortion with `missing data' automatic speech recognition

In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a h...

Full description

Saved in:

Bibliographic Details
Published in	Speech communication Vol. 43; no. 1; pp. 123 - 142
Main Authors	Palomäki, Kalle J, Brown, Guy J, Barker, Jon P
Format	Journal Article
Language	English
Published	Amsterdam Elsevier B.V 01.06.2004 Elsevier
Subjects	Applied sciences Artificial intelligence Computer science; control theory; systems Exact sciences and technology Information, signal and communications theory Missing data Reverberation Signal processing Spectral distortion Spectral normalisation Speech and sound recognition and synthesis. Linguistics Speech processing Speech recognition Telecommunications and information theory Spectral normalisation Missing data Spectral distortion Reverberation Speech recognition Additive noise Acoustic reverberation Standardization Convolution Hidden Markov models Classification Feature extraction Automatic recognition Signal distortion
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a hidden Markov model (HMM) classifier to deal with missing or unreliable features. Although the missing data paradigm was proposed as a means of handling additive noise in ASR, we demonstrate that it can also be effective in dealing with convolutional distortion. Firstly, we propose a normalisation technique for handling spectral distortions and changes of input level (possibly in the presence of additive noise). The technique computes a normalising factor only from the most intense regions of the speech spectrum, which are likely to remain intact across various noise conditions. We show that the proposed normalisation method improves performance compared to a conventional missing data approach with spectrally distorted and noise contaminated speech, and in conditions where the gain of the input signal varies. Secondly, we propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by using modulation filtering to identify `reliable' regions of the speech spectrum. We demonstrate that our approach improves recognition performance in cases where the reverberation time T 60 exceeds 0.7 s, compared to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation-filtered spectrogram.
Bibliography:	ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2
ISSN:	0167-6393 1872-7182
DOI:	10.1016/j.specom.2004.02.005