Disentangling Foreground and Background Motion for Enhanced Realism in Human Video Generation
Recent advancements in human video synthesis have enabled the generation of high-quality videos through the application of stable diffusion models. However, existing methods predominantly concentrate on animating solely the human element (the foreground) guided by pose information, while leaving the...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
25.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Recent advancements in human video synthesis have enabled the generation of
high-quality videos through the application of stable diffusion models.
However, existing methods predominantly concentrate on animating solely the
human element (the foreground) guided by pose information, while leaving the
background entirely static. Contrary to this, in authentic, high-quality
videos, backgrounds often dynamically adjust in harmony with foreground
movements, eschewing stagnancy. We introduce a technique that concurrently
learns both foreground and background dynamics by segregating their movements
using distinct motion representations. Human figures are animated leveraging
pose-based motion, capturing intricate actions. Conversely, for backgrounds, we
employ sparse tracking points to model motion, thereby reflecting the natural
interaction between foreground activity and environmental changes. Training on
real-world videos enhanced with this innovative motion depiction approach, our
model generates videos exhibiting coherent movement in both foreground subjects
and their surrounding contexts. To further extend video generation to longer
sequences without accumulating errors, we adopt a clip-by-clip generation
strategy, introducing global features at each step. To ensure seamless
continuity across these segments, we ingeniously link the final frame of a
produced clip with input noise to spawn the succeeding one, maintaining
narrative flow. Throughout the sequential generation process, we infuse the
feature representation of the initial reference image into the network,
effectively curtailing any cumulative color inconsistencies that may otherwise
arise. Empirical evaluations attest to the superiority of our method in
producing videos that exhibit harmonious interplay between foreground actions
and responsive background dynamics, surpassing prior methodologies in this
regard. |
---|---|
DOI: | 10.48550/arxiv.2405.16393 |