Techniques for handling convolutional distortion with `missing data' automatic speech recognition
In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a h...
Saved in:
Published in | Speech communication Vol. 43; no. 1; pp. 123 - 142 |
---|---|
Main Authors | , , |
Format | Journal Article |
Language | English |
Published |
Amsterdam
Elsevier B.V
01.06.2004
Elsevier |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this study we describe two techniques for handling convolutional distortion with `missing data' speech recognition using spectral features. The missing data approach to automatic speech recognition (ASR) is motivated by a model of human speech perception, and involves the modification of a hidden Markov model (HMM) classifier to deal with missing or unreliable features. Although the missing data paradigm was proposed as a means of handling additive noise in ASR, we demonstrate that it can also be effective in dealing with convolutional distortion. Firstly, we propose a normalisation technique for handling spectral distortions and changes of input level (possibly in the presence of additive noise). The technique computes a normalising factor only from the most intense regions of the speech spectrum, which are likely to remain intact across various noise conditions. We show that the proposed normalisation method improves performance compared to a conventional missing data approach with spectrally distorted and noise contaminated speech, and in conditions where the gain of the input signal varies. Secondly, we propose a method for handling reverberated speech which attempts to identify time-frequency regions that are not badly contaminated by reverberation and have strong speech energy. This is achieved by using modulation filtering to identify `reliable' regions of the speech spectrum. We demonstrate that our approach improves recognition performance in cases where the reverberation time
T
60 exceeds 0.7 s, compared to a baseline system which uses acoustic features derived from perceptual linear prediction and the modulation-filtered spectrogram. |
---|---|
Bibliography: | ObjectType-Article-2 SourceType-Scholarly Journals-1 ObjectType-Feature-1 content type line 23 ObjectType-Article-1 ObjectType-Feature-2 |
ISSN: | 0167-6393 1872-7182 |
DOI: | 10.1016/j.specom.2004.02.005 |