VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models

Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multipl...

Full description

Saved in:
Bibliographic Details
Main Authors Huang, Chi-Pin, Wu, Yen-Siang, Chung, Hung-Kai, Chang, Kai-Po, Yang, Fu-En, Wang, Yu-Chiang Frank
Format Journal Article
LanguageEnglish
Published 27.03.2025
Subjects
Online AccessGet full text
DOI10.48550/arxiv.2503.21781

Cover

Loading…
Abstract Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
AbstractList Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However, existing methods mainly focus on personalizing a single concept, either subject identity or motion pattern, limiting their effectiveness for multiple subjects with the desired motion patterns. To tackle this challenge, we propose a unified framework VideoMage for video customization over both multiple subjects and their interactive motions. VideoMage employs subject and motion LoRAs to capture personalized content from user-provided images and videos, along with an appearance-agnostic motion learning approach to disentangle motion patterns from visual appearance. Furthermore, we develop a spatial-temporal composition scheme to guide interactions among subjects within the desired motion patterns. Extensive experiments demonstrate that VideoMage outperforms existing methods, generating coherent, user-controlled videos with consistent subject identities and interactions.
Author Chang, Kai-Po
Wang, Yu-Chiang Frank
Yang, Fu-En
Wu, Yen-Siang
Huang, Chi-Pin
Chung, Hung-Kai
Author_xml – sequence: 1
  givenname: Chi-Pin
  surname: Huang
  fullname: Huang, Chi-Pin
– sequence: 2
  givenname: Yen-Siang
  surname: Wu
  fullname: Wu, Yen-Siang
– sequence: 3
  givenname: Hung-Kai
  surname: Chung
  fullname: Chung, Hung-Kai
– sequence: 4
  givenname: Kai-Po
  surname: Chang
  fullname: Chang, Kai-Po
– sequence: 5
  givenname: Fu-En
  surname: Yang
  fullname: Yang, Fu-En
– sequence: 6
  givenname: Yu-Chiang Frank
  surname: Wang
  fullname: Wang, Yu-Chiang Frank
BackLink https://doi.org/10.48550/arXiv.2503.21781$$DView paper in arXiv
BookMark eNrjYmDJy89LZWCQNDTQM7EwNTXQTyyqyCzTMzI1MNYzMjS3MORkCA7LTEnN901MT7VS8C3NKcnUDS5NykpNLlFIzEtR8M0vyczPU3AuLS7Jz82sSgTz8tMUQlIrSnRL8nXBmhVcMtPSSotBUr75Kak5xTwMrGmJOcWpvFCam0HezTXE2UMXbH18QVFmbmJRZTzIGfFgZxgTVgEA-xU_Kg
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2503.21781
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2503_21781
GroupedDBID AKY
GOX
ID FETCH-arxiv_primary_2503_217813
IEDL.DBID GOX
IngestDate Tue Jul 22 20:28:37 EDT 2025
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-arxiv_primary_2503_217813
OpenAccessLink https://arxiv.org/abs/2503.21781
ParticipantIDs arxiv_primary_2503_21781
PublicationCentury 2000
PublicationDate 2025-03-27
PublicationDateYYYYMMDD 2025-03-27
PublicationDate_xml – month: 03
  year: 2025
  text: 2025-03-27
  day: 27
PublicationDecade 2020
PublicationYear 2025
Score 3.8118432
SecondaryResourceType preprint
Snippet Customized text-to-video generation aims to produce high-quality videos that incorporate user-specified subject identities or motion patterns. However,...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Computer Vision and Pattern Recognition
Title VideoMage: Multi-Subject and Motion Customization of Text-to-Video Diffusion Models
URI https://arxiv.org/abs/2503.21781
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV1LT8MwDLbGTlwQE6Dx9oFrYE36oNymwZiQCgcG6q1K0kSqBBRtLeLn46RFcNk1L1mJLH-2Y38AFzrQkeBpzAgSxSzUccrScKKZDaQqy0QJ7tkassd48RI-5FE-APythZGr7-qr6w-s1ldkn8UlgWZXW73Fufuydf-Ud8lJ34qrX_-3jjCmH_pnJOa7sNOjO5x2zzGCgfnYg-fXqjR1Rop7g77elZG2uvAHkhePmafRwVlLKOy9L4rE2uLSuaRNzfxmvK2sbV1gCx152dt6H87nd8vZgnkxis-uZ0ThJCy8hOIAhuTZmzGgDBOXtxNSuqBCFMmQK55YrYXRwbVOD2G86ZSjzVPHsM0dSe1EMJ6cwLBZteaULGejzvz1_QBJCXKX
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=VideoMage%3A+Multi-Subject+and+Motion+Customization+of+Text-to-Video+Diffusion+Models&rft.au=Huang%2C+Chi-Pin&rft.au=Wu%2C+Yen-Siang&rft.au=Chung%2C+Hung-Kai&rft.au=Chang%2C+Kai-Po&rft.date=2025-03-27&rft_id=info:doi/10.48550%2Farxiv.2503.21781&rft.externalDocID=2503_21781