Cross-Modal Quantization for Co-Speech Gesture Generation

Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modaliti...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on multimedia Vol. 26; pp. 10251 - 10263
Main Authors	Wang, Zheng, Zhang, Wei, Ye, Long, Zeng, Dan, Mei, Tao
Format	Journal Article
Language	English
Published	IEEE 2024
Subjects	Co-speech gesture generation Codes cross-modal quantization Lips Quantization (signal) social robots Speech coding Speech enhancement Task analysis Three-dimensional displays VQ-VAE
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modalities. In this work, we propose a novel Cross-modal Quantization (CMQ) to jointly learn the quantized codes for speech and gesture together. Such representation highlights the speech-gesture interaction before actually learning the complex mapping, and thus better suits the intricate mapping between speech and gesture. Specifically, the Cross-modal Quantizer jointly encodes speech and gesture as discrete codebooks, enabling better cross-modal interaction. Cross-modal Predictor subsequently utilizes the learned codebooks to autoregressively predict the next-step gesture. With cross-modal quantization, our approach yields much higher codebook usage and generates more realistic and diverse gestures in practice. Extensive experiments are conducted on both 3D and 2D datasets as well as the subjective user study, demonstrating a clear performance gain compared to several baseline models in terms of audio-visual alignment and gesture diversity. In particular, our method demonstrates a three-fold improvement in diversity compared to baseline models, while simultaneously maintaining high motion fidelity.
ISSN:	1520-9210 1941-0077
DOI:	10.1109/TMM.2024.3405743