VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multipl...

Full description

Saved in:

Bibliographic Details
Main Authors	Huang, Chi-Pin, Wu, Yen-Siang, Chung, Hung-Kai, Chang, Kai-Po, Yang, Fu-En, Wang, Yu-Chiang Frank
Format	Journal Article
Language	English
Published	27.03.2025
Subjects	Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
DOI:	10.48550/arxiv.2503.21781