Learning Torso Prior for Co-Speech Gesture Generation with Better Hand Shape

Co-speech gesture generation is the task of synthesizing gesture sequences synchronized with an input audio signal. Previous methods try to estimate upper body gesture as a whole, ignoring the different mapping relations between audio and different body parts, which leads to poor overall results esp...

Full description

Saved in:

Bibliographic Details
Published in	2023 IEEE International Conference on Image Processing (ICIP) pp. 1 - 5
Main Authors	Wang, Hexiang, Liu, Fengqi, Yi, Ran, Ma, Lizhuang
Format	Conference Proceeding
Language	English
Published	IEEE 08.10.2023
Subjects	adversarial learning Adversarial machine learning co-speech gesture generation cross-modal learning Image processing Self-supervised learning Semantics Shape Synchronization Torso
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Co-speech gesture generation is the task of synthesizing gesture sequences synchronized with an input audio signal. Previous methods try to estimate upper body gesture as a whole, ignoring the different mapping relations between audio and different body parts, which leads to poor overall results especially bad hand shapes. In this paper, we propose a novel three-branch co-speech gesture generation framework to obtain better results. In particular, we propose a Torso2Hand Prior Learning module (T2HPL) to leverage torso information as an extra prior to enhance hand pose prediction, and carefully design a hand shape discriminator to improve the authenticity of generated hand shape. In addition, an arm orientation loss is designed to encourage the network to generate torso part with better semantic expressiveness. Experiments on dataset of four different speakers demonstrate the superiority of our method over the state-of-the-art approaches.
DOI:	10.1109/ICIP49359.2023.10222259