Multi-channel speech processing architectures for noise robust speech recognition: 3rd CHiME challenge results

Recognizing speech under noisy condition is an ill-posed problem. The CHiME 3 challenge targets robust speech recognition in realistic environments such as street, bus, caffee and pedestrian areas. We study variants of beamformers used for pre-processing multi-channel speech recordings. In particula...

Full description

Saved in:
Bibliographic Details
Published in2015 IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU) pp. 452 - 459
Main Authors Pfeifenberger, Lukas, Schrank, Tobias, Zohrer, Matthias, Hagmuller, Martin, Pernkopf, Franz
Format Conference Proceeding
LanguageEnglish
Published IEEE 01.12.2015
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Recognizing speech under noisy condition is an ill-posed problem. The CHiME 3 challenge targets robust speech recognition in realistic environments such as street, bus, caffee and pedestrian areas. We study variants of beamformers used for pre-processing multi-channel speech recordings. In particular, we investigate three variants of generalized side-lobe canceller (GSC) beamformers, i.e. GSC with sparse blocking matrix (BM), GSC with adaptive BM (ABM), and GSC with minimum variance distortionless response (MVDR) and ABM. Furthermore, we apply several post-filters to further enhance the speech signal. We introduce MaxPower postfilters and deep neural postfilters (DPFs). DPFs outperformed our baseline systems significantly when measuring the overall perceptual score (OPS) and the perceptual evaluation of speech quality (PESQ). In particular DPFs achieved an average relative improvement of 17.54% OPS points and 18.28% in PESQ, when compared to the CHiME 3 baseline. DPFs also achieved the best WER when combined with an ASR engine on simulated development and evaluation data, i.e. 8.98% and 10.82% WER. The proposed MaxPower beamformer achieved the best overall WER on CHiME 3 real development and evaluation data, i.e. 14.23% and 22.12%, respectively.
DOI:10.1109/ASRU.2015.7404830