Cross-Modal Quantization for Co-Speech Gesture Generation
Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modaliti...
Saved in:
Published in | IEEE transactions on multimedia Vol. 26; pp. 10251 - 10263 |
---|---|
Main Authors | , , , , |
Format | Journal Article |
Language | English |
Published |
IEEE
2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Learning proper representations for speech and gesture is essential for co-speech gesture generation. Existing approaches either utilize direct representations or independently encode the speech and gesture, which neglect the joint representation to highlight the interplay between these two modalities. In this work, we propose a novel Cross-modal Quantization (CMQ) to jointly learn the quantized codes for speech and gesture together. Such representation highlights the speech-gesture interaction before actually learning the complex mapping, and thus better suits the intricate mapping between speech and gesture. Specifically, the Cross-modal Quantizer jointly encodes speech and gesture as discrete codebooks, enabling better cross-modal interaction. Cross-modal Predictor subsequently utilizes the learned codebooks to autoregressively predict the next-step gesture. With cross-modal quantization, our approach yields much higher codebook usage and generates more realistic and diverse gestures in practice. Extensive experiments are conducted on both 3D and 2D datasets as well as the subjective user study, demonstrating a clear performance gain compared to several baseline models in terms of audio-visual alignment and gesture diversity. In particular, our method demonstrates a three-fold improvement in diversity compared to baseline models, while simultaneously maintaining high motion fidelity. |
---|---|
ISSN: | 1520-9210 1941-0077 |
DOI: | 10.1109/TMM.2024.3405743 |