DBN Based Models for Audio-Visual Speech Analysis and Recognition

We present an audio-visual automatic speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system consists of three components: (i) a visual module, (ii) an acoustic module, and...

Full description

Saved in:

Bibliographic Details
Published in	Advances in Multimedia Information Processing - PCM 2006 pp. 19 - 30
Main Authors	Ravyse, Ilse, Jiang, Dongmei, Jiang, Xiaoyue, Lv, Guoyun, Hou, Yunshu, Sahli, Hichem, Zhao, Rongchun
Format	Book Chapter Conference Proceeding
Language	English
Published	Berlin, Heidelberg Springer Berlin Heidelberg 2006 Springer
Series	Lecture Notes in Computer Science
Subjects	Acoustics Applied sciences Artificial intelligence Automatic Speech Recognition Computer science; control theory; systems Computer systems and distributed systems. User interface Dynamic Bayesian Network Exact sciences and technology Fundamental areas of phenomenology (including applications) Physics Software Speech and sound recognition and synthesis. Linguistics Speech Recognition Speech Recognition System Transduction; acoustical devices for the generation and reproduction of sound Visual Speech Speech analysis Image recognition Segmentation Video signal Modeling Noise level Audiovisual Audio acoustics Bayes network Audiovisual equipment Dynamic model Pattern extraction Speaker Multimedia Streaming Computer vision Head Distributed system Spectral analysis Persistence Lip Sound analysis Cepstrum Speech recognition Hidden Markov model Automatic recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present an audio-visual automatic speech recognition system, which significantly improves speech recognition performance over a wide range of acoustic noise levels, as well as under clean audio conditions. The system consists of three components: (i) a visual module, (ii) an acoustic module, and (iii) a Dynamic Bayesian Network-based recognition module. The vision module, locates and tracks the speaker head, and mouth movements and extracts relevant speech features represented by contour information and 3D deformations of lip movements. The acoustic module extracts noise-robust features, i.e. the Mel Filterbank Cepstrum Coefficients (MFCCs). Finally we propose two models based on Dynamic Bayesian Networks (DBN) to either consider the single audio and video streams or to integrate the features from the audio and visual streams. We also compare the proposed DBN based system with classical Hidden Markov Model. The novelty of the developed framework is the persistence of the audiovisual speech signal characteristics from the extraction step, through the learning step. Experiments on continuous audiovisual speech show that the segmentation boundaries of phones in the audio stream and visemes in the video stream are close to manual segmentation boundaries.
ISBN:	3540487662 9783540487661
ISSN:	0302-9743 1611-3349
DOI:	10.1007/11922162_3