VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multipl...
Saved in:
Main Authors | , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
27.03.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Customized text-to-video generation aims to produce high-quality videos that
incorporate user-specified subject identities or motion patterns. However,
existing methods mainly focus on personalizing a single concept, either subject
identity or motion pattern, limiting their effectiveness for multiple subjects
with the desired motion patterns. To tackle this challenge, we propose a
unified framework VideoMage for video customization over both multiple subjects
and their interactive motions. VideoMage employs subject and motion LoRAs to
capture personalized content from user-provided images and videos, along with
an appearance-agnostic motion learning approach to disentangle motion patterns
from visual appearance. Furthermore, we develop a spatial-temporal composition
scheme to guide interactions among subjects within the desired motion patterns.
Extensive experiments demonstrate that VideoMage outperforms existing methods,
generating coherent, user-controlled videos with consistent subject identities
and interactions. |
---|---|
DOI: | 10.48550/arxiv.2503.21781 |