Predicting speech intelligibility with deep neural networks

•An automatic speech recognizer using deep neural networks is proposed as model to predict speech intelligibility (SI).•The DNN-based model predicts SI in normal-hearing listeners more accurately than four established SI models.•In contrast to baseline models, the proposed model predicts intelligibi...

Full description

Saved in:

Bibliographic Details
Published in	Computer speech & language Vol. 48; pp. 51 - 66
Main Authors	Spille, Constantin, Ewert, Stephan D., Kollmeier, Birger, Meyer, Bernd T.
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.03.2018
Subjects	Automatic speech recognition Deep neural networks Speech intelligibility prediction Speech intelligibility prediction Deep neural networks Automatic speech recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	•An automatic speech recognizer using deep neural networks is proposed as model to predict speech intelligibility (SI).•The DNN-based model predicts SI in normal-hearing listeners more accurately than four established SI models.•In contrast to baseline models, the proposed model predicts intelligibility from the noisy speech signal and does not require separated noise and speech input.•A relevance propagation algorithm shows that DNNs can listen in the dips in modulated maskers. [Display omitted] An accurate objective prediction of human speech intelligibility is of interest for many applications such as the evaluation of signal processing algorithms. To predict the speech recognition threshold (SRT) of normal-hearing listeners, an automatic speech recognition (ASR) system is employed that uses a deep neural network (DNN) to convert the acoustic input into phoneme predictions, which are subsequently decoded into word transcripts. ASR results are obtained with and compared to data presented in Schubotz et al. (2016), which comprises eight different additive maskers that range from speech-shaped stationary noise to a single-talker interferer and responses from eight normal-hearing subjects. The task for listeners and ASR is to identify noisy words from a German matrix sentence test in monaural conditions. Two ASR training schemes typically used in applications are considered: (A) matched training, which uses the same noise type for training and testing and (B) multi-condition training, which covers all eight maskers. For both training schemes, ASR-based predictions outperform established measures such as the extended speech intelligibility index (ESII), the multi-resolution speech envelope power spectrum model (mr-sEPSM) and others. This result is obtained with a speaker-independent model that compares the word labels of the utterance with the ASR transcript, which does not require separate noise and speech signals. The best predictions are obtained for multi-condition training with amplitude modulation features, which implies that the noise type has been seen during training. Predictions and measurements are analyzed by comparing speech recognition thresholds and individual psychometric functions to the DNN-based results.
ISSN:	0885-2308 1095-8363
DOI:	10.1016/j.csl.2017.10.004