Improving Audio-Visual Speech Recognition Using Gabor Recurrent Neural Networks

The performance of speech recognition systems can be significantly improved when visual information is used in conjunction with the audio ones, especially in noisy environments. Prompted by the great achievements of deep learning in solving Audio-Visual Speech Recognition (AVSR) problems, we propose...

Full description

Saved in:

Bibliographic Details
Published in	Multimodal Pattern Recognition of Social Signals in Human-Computer-Interaction Vol. 11377; pp. 71 - 83
Main Authors	Saudi, Ali S., Khalil, Mahmoud I., Abbas, Hazem M.
Format	Book Chapter
Language	English
Published	Switzerland Springer International Publishing AG 2019 Springer International Publishing
Series	Lecture Notes in Computer Science
Subjects	Audio-Visual Speech Recognition Bidirectional Recurrent Neural Network Gabor filters
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The performance of speech recognition systems can be significantly improved when visual information is used in conjunction with the audio ones, especially in noisy environments. Prompted by the great achievements of deep learning in solving Audio-Visual Speech Recognition (AVSR) problems, we propose a deep AVSR model based on Long Short-Term Memory Bidirectional Recurrent Neural Network (LSTM-BRNN). The proposed deep AVSR model utilizes the Gabor filters in both the audio and visual front-ends with Early Integration (EI) scheme. This model is termed as BRNNav\documentclass[12pt]{minimal} \usepackage{amsmath} \usepackage{wasysym} \usepackage{amsfonts} \usepackage{amssymb} \usepackage{amsbsy} \usepackage{mathrsfs} \usepackage{upgreek} \setlength{\oddsidemargin}{-69pt} \begin{document}$$_{av}$$\end{document} model. The Gabor features simulate the underlying spatiotemporal processing chain that occurs in the Primary Audio Cortex (PAC) in conjunction with Primary Visual Cortex (PVC). We named it Gabor Audio Features (GAF) and Gabor Visual Features (GVF). The experimental results show that the deep Gabor (LSTM-BRNNs)-based model achieves superior performance when compared to the (GMM-HMM)-based models which utilize the same front-ends. Furthermore, the use of GAF and GVF in both audio and visual front-ends attain significant improvement in the performance compared to the traditional audio and visual features.
ISBN:	3030209830 9783030209834
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-030-20984-1_7