Robust multi-modal person identification with tolerance of facial expression

The research presented in This work describes audio-visual speaker identification experiments carried out on a large data set of 251 subjects. Both the audio and visual modeling is carried out using hidden Markov models. The visual modality uses the speaker's lip information. The audio and visu...

Full description

Saved in:

Bibliographic Details
Published in	2004 IEEE International Conference on Systems, Man and Cybernetics Vol. 1; pp. 580 - 585 vol.1
Main Authors	Fox, N.A., Reilly, R.B.
Format	Conference Proceeding
Language	English
Published	Piscataway NJ IEEE 2004
Subjects	Applied sciences Artificial intelligence Audio databases Computer science; control theory; systems Control theory. Systems Exact sciences and technology Hidden Markov models Identification of persons Image databases Operational research and scientific management Operational research. Management science Pattern recognition. Digital image processing. Computational geometry Reliability theory. Replacement problems Signal processing Visual databases Performance evaluation Computer vision Modal analysis Very large databases Classifier Speaker recognition Modeling Lip Classification Facies Speech recognition Hidden Markov model Reliability Train Facial expression Visual control Speaker
Online Access	Get full text

Cover

Loading…

More Information
Summary:	The research presented in This work describes audio-visual speaker identification experiments carried out on a large data set of 251 subjects. Both the audio and visual modeling is carried out using hidden Markov models. The visual modality uses the speaker's lip information. The audio and visual modalities are both degraded to emulate a train/test mismatch. The fusion method employed adapts automatically by using classifier score reliability estimates of both modalities to give improved audio-visual accuracies at all tested levels of audio and visual degradation, compared to the individual audio or visual modality accuracies. A maximum visual identification accuracy of 86% was achieved. This result is comparable to the performance of systems using the entire face, and suggests the hypothesis that the system described would be tolerant to varying facial expression, since only the information around the speaker's lips is employed.
ISBN:	0780385667 9780780385665
ISSN:	1062-922X
DOI:	10.1109/ICSMC.2004.1398362