Convolutional Maximum-Likelihood Distortionless Response Beamforming With Steering Vector Estimation for Robust Speech Recognition

Beamforming has been one of the most successful approaches using multi-microphones for robust speech recognition. Although a beamforming method, called the "maximum-likelihood distortionless response (MLDR)" beamformer, was recently presented to achieve promising performance, it requires a...

Full description

Saved in:
Bibliographic Details
Published inIEEE/ACM transactions on audio, speech, and language processing Vol. 29; pp. 1352 - 1367
Main Authors Cho, Byung Joon, Park, Hyung-Min
Format Journal Article
LanguageEnglish
Published Piscataway IEEE 2021
The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Beamforming has been one of the most successful approaches using multi-microphones for robust speech recognition. Although a beamforming method, called the "maximum-likelihood distortionless response (MLDR)" beamformer, was recently presented to achieve promising performance, it requires an accurate steering vector for a target speaker in advance like many kinds of beamformers. In this paper, we present a method for steering vector estimation (SVE) by replacing the noise spatial covariance matrix estimate with a normalized version of the variance-weighted spatial covariance matrix estimate for the observed noisy speech signal obtained by the iterative update rule in the MLDR beamforming framework. In addition, an MLDR beamforming method without a steering vector for a target speaker given in advance is presented where the SVE and the beamforming are alternately repeated. Furthermore, an online algorithm based on recursive least squares (RLS) is derived to cope with various practical applications including time-varying situations, and the power method is introduced for further efficient online processing. We also present batch and online convolutional MLDR beamforming with SVE for simultaneous beamforming and dereverberation where the weighted prediction error (WPE) dereverberation and the MLDR beamforming with the SVE were jointly optimized based on the maximum-likelihood estimation (MLE) for a zero-mean complex Gaussian signal with time-varying variances. Moreover, input signals masked by a neural network (NN) for estimating target speech or noise components can be used to further improve the presented beamformers. Experimental results on the CHiME-4 and REVERB challenge datasets demonstrate the effectiveness of the presented methods.
ISSN:2329-9290
2329-9304
DOI:10.1109/TASLP.2021.3067202