Mimic: Speaking Style Disentanglement for Speech-Driven 3D Facial Animation
Speech-driven 3D facial animation aims to synthesize vivid facial animations that accurately synchronize with speech and match the unique speaking style. However, existing works primarily focus on achieving precise lip synchronization while neglecting to model the subject-specific speaking style, of...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
17.12.2023
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2312.10877 |
Cover
Loading…
Summary: | Speech-driven 3D facial animation aims to synthesize vivid facial animations
that accurately synchronize with speech and match the unique speaking style.
However, existing works primarily focus on achieving precise lip
synchronization while neglecting to model the subject-specific speaking style,
often resulting in unrealistic facial animations. To the best of our knowledge,
this work makes the first attempt to explore the coupled information between
the speaking style and the semantic content in facial motions. Specifically, we
introduce an innovative speaking style disentanglement method, which enables
arbitrary-subject speaking style encoding and leads to a more realistic
synthesis of speech-driven facial animations. Subsequently, we propose a novel
framework called \textbf{Mimic} to learn disentangled representations of the
speaking style and content from facial motions by building two latent spaces
for style and content, respectively. Moreover, to facilitate disentangled
representation learning, we introduce four well-designed constraints: an
auxiliary style classifier, an auxiliary inverse classifier, a content
contrastive loss, and a pair of latent cycle losses, which can effectively
contribute to the construction of the identity-related style space and
semantic-related content space. Extensive qualitative and quantitative
experiments conducted on three publicly available datasets demonstrate that our
approach outperforms state-of-the-art methods and is capable of capturing
diverse speaking styles for speech-driven 3D facial animation. The source code
and supplementary video are publicly available at:
https://zeqing-wang.github.io/Mimic/ |
---|---|
DOI: | 10.48550/arxiv.2312.10877 |