Expression-Preserving Face Frontalization Improves Visually Assisted Speech Processing

Face frontalization consists of synthesizing a frontal view from a profile one. This paper proposes a frontalization method that preserves non-rigid facial deformations, i.e. facial expressions. It is shown that expression-preserving frontalization boosts the performance of visually assisted speech...

Full description

Saved in:

Bibliographic Details
Published in	International journal of computer vision Vol. 131; no. 5; pp. 1122 - 1140
Main Authors	Kang, Zhiqi, Sadeghi, Mostafa, Horaud, Radu, Alameda-Pineda, Xavier
Format	Journal Article
Language	English
Published	New York Springer US 01.05.2023 Springer Springer Nature B.V Springer Verlag
Subjects	Artificial Intelligence Communication Computer Imaging Computer Science Computer Vision and Pattern Recognition Cross correlation Head movement Image Processing and Computer Vision Intelligibility Lipreading Machine Learning Methods Pattern Recognition Pattern Recognition and Graphics Registration Sound Special Issue on Traditional Computer Vision in the Age of Deep Learning Speech Speech processing Speech recognition Statistical inference Vision Voice recognition Variational auto-encoders Student’s t-distribution Face frontalization Robust point registration Bayesian filtering Lip reading Audio-visual speech enhancement
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Face frontalization consists of synthesizing a frontal view from a profile one. This paper proposes a frontalization method that preserves non-rigid facial deformations, i.e. facial expressions. It is shown that expression-preserving frontalization boosts the performance of visually assisted speech processing. The method alternates between the estimation of (i) the rigid transformation (scale, rotation, and translation) and (ii) the non-rigid deformation between an arbitrarily-viewed face and a face model. The method has two important merits: it can deal with non-Gaussian errors in the data and it incorporates a dynamical face deformation model. For that purpose, we use the Student’s t-distribution in combination with a Bayesian filter in order to account for both rigid head motions and time-varying facial deformations, e.g. caused by speech production. The zero-mean normalized cross-correlation score is used to evaluate the ability of the method to preserve facial expressions. The method is thoroughly evaluated and compared with several state of the art methods, either based on traditional geometric models or on deep learning. Moreover, we show that the method, when incorporated into speech processing pipelines, improves word recognition rates and speech intelligibility scores by a considerable margin.
ISSN:	0920-5691 1573-1405
DOI:	10.1007/s11263-022-01742-1