VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multipl...

Full description

Saved in:

Bibliographic Details
Main Authors	Huang, Chi-Pin, Wu, Yen-Siang, Chung, Hung-Kai, Chang, Kai-Po, Yang, Fu-En, Wang, Yu-Chiang Frank
Format	Journal Article
Language	English
Published	27.03.2025
Subjects	Computer Science - Computer Vision and Pattern Recognition
Online Access	Get full text
DOI	10.48550/arxiv.2503.21781

Cover

Loading…

Abstract	Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
AbstractList	Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
Author	Chang, Kai-Po Wang, Yu-Chiang Frank Yang, Fu-En Wu, Yen-Siang Huang, Chi-Pin Chung, Hung-Kai
Author_xml	– sequence: 1 givenname: Chi-Pin surname: Huang fullname: Huang, Chi-Pin – sequence: 2 givenname: Yen-Siang surname: Wu fullname: Wu, Yen-Siang – sequence: 3 givenname: Hung-Kai surname: Chung fullname: Chung, Hung-Kai – sequence: 4 givenname: Kai-Po surname: Chang fullname: Chang, Kai-Po – sequence: 5 givenname: Fu-En surname: Yang fullname: Yang, Fu-En – sequence: 6 givenname: Yu-Chiang Frank surname: Wang fullname: Wang, Yu-Chiang Frank
BackLink	https://doi.org/10.48550/arXiv.2503.21781$$DView paper in arXiv
BookMark	eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzI1MNYzMjS3MORkCA7LTEnN901MT7VS8C3NKcnUDS5NykpNLlFIzEtR8M0vyczPU3AuLS7Jz82sSgTz8tMUQlIrSnRL8nXBmhVcMtPSSotBUr75Kak5xTwMrGmJOcWpvFCam0HezTXE2UMXbH18QVFmbmJRZTzIGfFgZxgTVgEA-xU_Kg
ContentType	Journal Article
Copyright	http://creativecommons.org/licenses/by/4.0
Copyright_xml	– notice: http://creativecommons.org/licenses/by/4.0
DBID	AKY GOX
DOI	10.48550/arxiv.2503.21781
DatabaseName	arXiv Computer Science arXiv.org
DatabaseTitleList
Database_xml	– sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository
DeliveryMethod	fulltext_linktorsrc
ExternalDocumentID	2503_21781
GroupedDBID	AKY GOX
ID	FETCH-arxiv_primary_2503_217813
IEDL.DBID	GOX
IngestDate	Tue Jul 22 20:28:37 EDT 2025
IsDoiOpenAccess	true
IsOpenAccess	true
IsPeerReviewed	false
IsScholarly	false
Language	English
LinkModel	DirectLink
MergedId	FETCHMERGED-arxiv_primary_2503_217813
OpenAccessLink	https://arxiv.org/abs/2503.21781
ParticipantIDs	arxiv_primary_2503_21781
PublicationCentury	2000
PublicationDate	2025-03-27
PublicationDateYYYYMMDD	2025-03-27
PublicationDate_xml	– month: 03 year: 2025 text: 2025-03-27 day: 27
PublicationDecade	2020
PublicationYear	2025
Score	3.8118432
SecondaryResourceType	preprint
Snippet	Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However,...
SourceID	arxiv
SourceType	Open Access Repository
SubjectTerms	Computer Science - Computer Vision and Pattern Recognition
Title	VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
URI	https://arxiv.org/abs/2503.21781
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLbGTlwQE6Dx9oFrYE36oNymwZiQCgcG6q1K0kSqBBRtLeLn46RFcNk1L1mJLH-2Y38AFzrQkeBpzAgSxSzUccrScKKZDaQqy0QJ7tkassd48RI-5FE-APythZGr7-qr6w-s1ldkn8UlgWZXW73Fufuydf-Ud8lJ34qrX_-3jjCmH_pnJOa7sNOjO5x2zzGCgfnYg-fXqjR1Rop7g77elZG2uvAHkhePmafRwVlLKOy9L4rE2uLSuaRNzfxmvK2sbV1gCx152dt6H87nd8vZgnkxis-uZ0ThJCy8hOIAhuTZmzGgDBOXtxNSuqBCFMmQK55YrYXRwbVOD2G86ZSjzVPHsM0dSe1EMJ6cwLBZteaULGejzvz1_QBJCXKX
linkProvider	Cornell University
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VideoMage%3A+Multi-Subject+and+Motion+Customization+of+Text-to-Video+Diffusion+Models&rft.au=Huang%2C+Chi-Pin&rft.au=Wu%2C+Yen-Siang&rft.au=Chung%2C+Hung-Kai&rft.au=Chang%2C+Kai-Po&rft.date=2025-03-27&rft_id=info:doi/10.48550%2Farxiv.2503.21781&rft.externalDocID=2503_21781