DisenDreamer: Subject-Driven Text-to-Image Generation With Sample-Aware Disentangled Tuning

Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., th...

Full description

Saved in:
Bibliographic Details
Published inIEEE transactions on circuits and systems for video technology Vol. 34; no. 8; pp. 6860 - 6873
Main Authors Chen, Hong, Zhang, Yipeng, Wang, Xin, Duan, Xuguang, Zhou, Yuwei, Zhu, Wenwu
Format Journal Article
LanguageEnglish
Published IEEE 01.08.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability.
AbstractList Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability.
Author Zhou, Yuwei
Zhu, Wenwu
Chen, Hong
Wang, Xin
Zhang, Yipeng
Duan, Xuguang
Author_xml – sequence: 1
  givenname: Hong
  orcidid: 0000-0002-0943-2286
  surname: Chen
  fullname: Chen, Hong
  email: h-chen20@mails.tsinghua.edu.cn
  organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China
– sequence: 2
  givenname: Yipeng
  orcidid: 0009-0002-0886-8296
  surname: Zhang
  fullname: Zhang, Yipeng
  email: zhang-yp22@mails.tsinghua.edu.cn
  organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China
– sequence: 3
  givenname: Xin
  orcidid: 0000-0002-0351-2939
  surname: Wang
  fullname: Wang, Xin
  email: xin_wang@tsinghua.edu.cn
  organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China
– sequence: 4
  givenname: Xuguang
  orcidid: 0000-0001-9108-9618
  surname: Duan
  fullname: Duan, Xuguang
  email: dxg18@mails.tsinghua.edu.cn
  organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China
– sequence: 5
  givenname: Yuwei
  orcidid: 0000-0001-9582-7331
  surname: Zhou
  fullname: Zhou, Yuwei
  email: zhou-yw21@mails.tsinghua.edu.cn
  organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China
– sequence: 6
  givenname: Wenwu
  orcidid: 0000-0003-2236-9290
  surname: Zhu
  fullname: Zhu, Wenwu
  email: wwzhu@tsinghua.edu.cn
  organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China
BookMark eNp9kLtOwzAUhi1UJNrCCyAGv4CLr4nDVrVQKlViaICBIXKSk-IqcSrH5fL29DYgBqbzL9-v_3wD1HOtA4SuGR0xRpPbdLJ8SUeccjkSIkpiFZ-hPlNKE86p6u0yVYxoztQFGnTdmlImtYz76G1qO3BTD6YBf4eX23wNRSBTbz_A4RS-AgktmTdmBXgGDrwJtnX41YZ3vDTNpgYy_jQe8KEnGLeqocTp1lm3ukTnlak7uDrdIXp-uE8nj2TxNJtPxgtS8EgHolkueAyJyFVBpdFJqWMjFChaViLJK1A84pGodn_uAUVjkKYsChbFrJQlFUOkj72Fb7vOQ5UVNhx2Bm9snTGa7SVlB0nZXlJ2krRD-R90421j_Pf_0M0RsgDwC5BScanED48SddU
CODEN ITCTEM
CitedBy_id crossref_primary_10_1145_3701988
crossref_primary_10_1109_TCSVT_2024_3449290
crossref_primary_10_1109_TCSVT_2024_3430529
crossref_primary_10_1109_TCSVT_2024_3485236
crossref_primary_10_1109_TCSVT_2024_3472036
crossref_primary_10_1145_3675164
crossref_primary_10_1109_TCSVT_2024_3443122
Cites_doi 10.1109/ICCV51070.2023.00685
10.1109/ICME51207.2021.9428193
10.1109/ICCV51070.2023.00673
10.1109/CVPR52688.2022.01767
10.1109/TPAMI.2013.50
10.1109/CVPR52729.2023.02155
10.1007/978-3-031-19787-1_41
10.18653/v1/P19-1041
10.1109/ICCV51070.2023.01461
10.1109/CVPR.2018.00143
10.1109/CVPR42600.2020.00790
10.1109/ICCV51070.2023.00355
10.1109/CVPR52688.2022.00246
10.1109/CVPR.2018.00018
10.1145/3592133
10.1109/CVPR.2018.00068
10.1007/978-3-031-19784-0_41
10.1007/978-3-319-24574-4_28
10.1109/TIP.2023.3240024
10.1109/CVPR52729.2023.00979
10.1109/ICCV51070.2023.02065
10.1109/CVPR52729.2023.00977
10.1109/CVPR52688.2022.01042
ContentType Journal Article
DBID 97E
RIA
RIE
AAYXX
CITATION
DOI 10.1109/TCSVT.2024.3369757
DatabaseName IEEE All-Society Periodicals Package (ASPP) 2005-present
IEEE All-Society Periodicals Package (ASPP) 1998–Present
IEEE Electronic Library (IEL)
CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Engineering
EISSN 1558-2205
EndPage 6873
ExternalDocumentID 10_1109_TCSVT_2024_3369757
10445245
Genre orig-research
GrantInformation_xml – fundername: Beijing Key Laboratory of Networked Multimedia
– fundername: National Natural Science Foundation of China
  grantid: 62222209; 62250008; 62102222
  funderid: 10.13039/501100001809
– fundername: Beijing National Research Center for Information Science and Technology
  grantid: BNR2023RC01003; BNR2023TD03006
  funderid: 10.13039/501100017582
– fundername: National Key Research and Development Program of China
  grantid: 2023YFF1205001
  funderid: 10.13039/501100012166
GroupedDBID -~X
0R~
29I
4.4
5GY
5VS
6IK
97E
AAJGR
AARMG
AASAJ
AAWTH
ABAZT
ABQJQ
ABVLG
ACGFO
ACGFS
ACIWK
AENEX
AETIX
AGQYO
AGSQL
AHBIQ
AI.
AIBXA
AKJIK
AKQYR
ALLEH
ALMA_UNASSIGNED_HOLDINGS
ASUFR
ATWAV
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CS3
DU5
EBS
EJD
HZ~
H~9
ICLAB
IFIPE
IFJZH
IPLJI
JAVBF
LAI
M43
O9-
OCL
P2P
RIA
RIE
RNS
RXW
TAE
TN5
VH1
AAYXX
CITATION
RIG
ID FETCH-LOGICAL-c268t-81b327e93b5c04a89d87a35e50df39bfe526263f109c268507e4adcc1671d4d03
IEDL.DBID RIE
ISSN 1051-8215
IngestDate Tue Jul 01 00:41:25 EDT 2025
Thu Apr 24 22:52:35 EDT 2025
Wed Aug 27 02:32:45 EDT 2025
IsPeerReviewed true
IsScholarly true
Issue 8
Language English
License https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html
https://doi.org/10.15223/policy-029
https://doi.org/10.15223/policy-037
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c268t-81b327e93b5c04a89d87a35e50df39bfe526263f109c268507e4adcc1671d4d03
ORCID 0000-0001-9108-9618
0000-0002-0943-2286
0000-0002-0351-2939
0000-0001-9582-7331
0000-0003-2236-9290
0009-0002-0886-8296
PageCount 14
ParticipantIDs ieee_primary_10445245
crossref_citationtrail_10_1109_TCSVT_2024_3369757
crossref_primary_10_1109_TCSVT_2024_3369757
ProviderPackageCode CITATION
AAYXX
PublicationCentury 2000
PublicationDate 2024-08-01
PublicationDateYYYYMMDD 2024-08-01
PublicationDate_xml – month: 08
  year: 2024
  text: 2024-08-01
  day: 01
PublicationDecade 2020
PublicationTitle IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev TCSVT
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
References ref13
ref15
Hertz (ref17)
Brown (ref5)
ref52
ref11
Ma (ref28)
Ramesh (ref34) 2022
Ho (ref18); 33
Kumari (ref24) 2022
Chang (ref6) 2023
Ma (ref30) 2023
Nichol (ref32)
Hu (ref19)
Wang (ref42) 2022
ref51
ref50
Chen (ref7)
ref46
ref45
ref48
ref47
ref43
Song (ref41)
ref49
Brooks (ref4) 2022
Meng (ref31) 2021
ref9
ref3
Locatello (ref26)
Saharia (ref39); 35
Loshchilov (ref27)
ref37
ref36
Gonzalez-Garcia (ref14); 31
Radford (ref33)
Kawar (ref21) 2022
ref2
ref1
Gal (ref12) 2022
ref38
Kim (ref23) 2023
He (ref16)
Shi (ref40) 2023
Wang (ref44)
ref25
ref20
ref22
Ramesh (ref35)
ref29
Dong (ref10) 2022
Chen (ref8) 2023
References_xml – ident: ref47
  doi: 10.1109/ICCV51070.2023.00685
– year: 2022
  ident: ref21
  article-title: Imagic: Text-based real image editing with diffusion models
  publication-title: arXiv:2210.09276
– ident: ref43
  doi: 10.1109/ICME51207.2021.9428193
– ident: ref15
  doi: 10.1109/ICCV51070.2023.00673
– ident: ref1
  doi: 10.1109/CVPR52688.2022.01767
– ident: ref3
  doi: 10.1109/TPAMI.2013.50
– start-page: 1877
  volume-title: Proc. NIPS
  ident: ref5
  article-title: Language models are few-shot learners
– start-page: 1
  volume-title: Proc. Int. Conf. Learn. Represent.
  ident: ref17
  article-title: Prompt-to-prompt image editing with cross attention control
– year: 2022
  ident: ref12
  article-title: An image is worth one word: Personalizing text-to-image generation using textual inversion
  publication-title: arXiv:2208.01618
– start-page: 5712
  volume-title: Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst.
  ident: ref28
  article-title: Learning disentangled representations for recommendation
– volume: 35
  start-page: 36479
  volume-title: Proc. NIPS
  ident: ref39
  article-title: Photorealistic text-to-image diffusion models with deep language understanding
– year: 2023
  ident: ref30
  article-title: Unified multi-modal latent diffusion for joint subject and text conditional image generation
  publication-title: arXiv:2303.09319
– ident: ref38
  doi: 10.1109/CVPR52729.2023.02155
– ident: ref46
  doi: 10.1007/978-3-031-19787-1_41
– start-page: 8821
  volume-title: Proc. Int. Conf. Mach. Learn. (ICML)
  ident: ref35
  article-title: Zero-shot text-to-image generation
– ident: ref20
  doi: 10.18653/v1/P19-1041
– ident: ref45
  doi: 10.1109/ICCV51070.2023.01461
– year: 2022
  ident: ref10
  article-title: DreamArtist: Towards controllable one-shot text-to-image generation via positive-negative prompt-tuning
  publication-title: arXiv:2211.11337
– start-page: 1
  volume-title: Proc. Int. Conf. Learn. Represent.
  ident: ref41
  article-title: Denoising diffusion implicit models
– ident: ref48
  doi: 10.1109/CVPR.2018.00143
– start-page: 1
  volume-title: Proc. Int. Conf. Learn. Represent.
  ident: ref19
  article-title: LoRa: Low-rank adaptation of large language models
– ident: ref25
  doi: 10.1109/CVPR42600.2020.00790
– ident: ref49
  doi: 10.1109/ICCV51070.2023.00355
– start-page: 26924
  volume-title: Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst.
  ident: ref7
  article-title: Curriculum disentangled recommendation with noisy multi-feedback
– year: 2023
  ident: ref23
  article-title: Towards safe self-distillation of internet-scale text-to-image diffusion models
  publication-title: arXiv:2307.05977
– ident: ref22
  doi: 10.1109/CVPR52688.2022.00246
– year: 2022
  ident: ref24
  article-title: Multi-concept customization of Text-to-Image diffusion
  publication-title: arXiv:2212.04488
– ident: ref29
  doi: 10.1109/CVPR.2018.00018
– ident: ref13
  doi: 10.1145/3592133
– ident: ref50
  doi: 10.1109/CVPR.2018.00068
– start-page: 4114
  volume-title: Proc. 36th Int. Conf. Mach. Learn.
  ident: ref26
  article-title: Challenging common assumptions in the unsupervised learning of disentangled representations
– year: 2022
  ident: ref42
  article-title: Disentangled representation learning
  publication-title: arXiv:2211.11695
– start-page: 1
  volume-title: Proc. 9th Int. Conf. Learn. Represent. (ICLR)
  ident: ref16
  article-title: DeBERTa: Decoding-enhanced BERT with disentangled attention
– start-page: 16784
  volume-title: Proc. Int. Conf. Mach. Learn. (ICML)
  ident: ref32
  article-title: GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models
– ident: ref2
  doi: 10.1007/978-3-031-19784-0_41
– ident: ref37
  doi: 10.1007/978-3-319-24574-4_28
– ident: ref9
  doi: 10.1109/TIP.2023.3240024
– year: 2023
  ident: ref40
  article-title: InstantBooth: Personalized text-to-image generation without test-time finetuning
  publication-title: arXiv:2304.03411
– volume: 33
  start-page: 6840
  volume-title: Proc. NIPS
  ident: ref18
  article-title: Denoising diffusion probabilistic models
– ident: ref52
  doi: 10.1109/CVPR52729.2023.00979
– ident: ref51
  doi: 10.1109/ICCV51070.2023.02065
– year: 2022
  ident: ref4
  article-title: InstructPix2Pix: Learning to follow image editing instructions
  publication-title: arXiv:2211.09800
– ident: ref11
  doi: 10.1109/CVPR52729.2023.00977
– year: 2023
  ident: ref6
  article-title: Muse: Text-to-image generation via masked generative transformers
  publication-title: arXiv:2301.00704
– start-page: 8748
  volume-title: Proc. Int. Conf. Mach. Learn.
  ident: ref33
  article-title: Learning transferable visual models from natural language supervision
– ident: ref36
  doi: 10.1109/CVPR52688.2022.01042
– year: 2021
  ident: ref31
  article-title: SDEdit: Guided image synthesis and editing with stochastic differential equations
  publication-title: arXiv:2108.01073
– start-page: 36174
  volume-title: Proc. Int. Conf. Mach. Learn.
  ident: ref44
  article-title: Curriculum co-disentangled representation learning across multiple environments for social recommendation
– year: 2022
  ident: ref34
  article-title: Hierarchical text-conditional image generation with CLIP latents
  publication-title: arXiv:2204.06125
– start-page: 1
  volume-title: Proc. Int. Conf. Learn. Represent.
  ident: ref27
  article-title: Decoupled weight decay regularization
– volume: 31
  start-page: 1294
  volume-title: Proc. Adv. Neural Inf. Process. Syst.
  ident: ref14
  article-title: Image-to-image translation for cross-domain disentanglement
– year: 2023
  ident: ref8
  article-title: Subject-driven text-to-image generation via apprenticeship learning
  publication-title: arXiv:2304.00186
SSID ssj0014847
Score 2.5512915
Snippet Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing...
SourceID crossref
ieee
SourceType Enrichment Source
Index Database
Publisher
StartPage 6860
SubjectTerms Circuits and systems
Controllability
Diffusion model
disentangled finetuning
Image synthesis
Noise reduction
subject-driven text-to-image generation
Training
Tuning
Visualization
Title DisenDreamer: Subject-Driven Text-to-Image Generation With Sample-Aware Disentangled Tuning
URI https://ieeexplore.ieee.org/document/10445245
Volume 34
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9uJz34OXF-kYM3yWyb9CPexuaYgrus04GH0javOtw6mR2Cf70vaSdDULyVkoSQX5L3e0ne7xFykfk2uFbmskw5wARwj0mVcfRSHOmjTbeUifC-H3j9kbgbu-MqWN3EwgCAeXwGLf1p7vLVPF3qozJc4UK4jnBrpIaeWxms9X1lIAKTTQz5gs0CNGSrCBlLXoWd4UOIvqAjWpx70te2aM0KraVVMValt0MGq_6Uj0leW8siaaWfP6Qa_93hXbJd8UvaLifEHtmAfJ9srakOHpCn7uQd8i6yxRksriluHfoshnUXeuOjoXaFizm7neFOQ0tVag0efZwUL3QYazVh1v6IF0BNO0gun6egaLjUJywNMurdhJ0-q3IssNTxgoIha-WOD5InbmqJOJAq8GPuIn6ImEwycB2tV5PhMOoKyB5BxCpNbc-3lVAWPyT1fJ7DEaGBh1QK3alY8UzgupaJ9JX0bBUHXmzzrEns1ZhHaSVArvNgTCPjiFgyMjhFGqeowqlJLr_rvJXyG3-WbmgM1kqWw3_8y_8Tsqmrl-_5Tkm9WCzhDDlGkZybufUFPgnLhw
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9-HNSD3-K3OXiTzLZJ2sabbMqmbpd1OvBQ2uZVRd1kdgj-9b6knQxB8VZKEkJ-Sd7vJXm_R8hxHrggnVyyXHvABHCfKZ1z9FI8FaBNd7SN8G53_GZPXPVlvwpWt7EwAGAfn0HNfNq7fD3MxuaoDFe4ENITcpbMo-GXbhmu9X1pIEKbTwwZg8tCNGWTGBlHnUb17m2E3qAnapz7KjDWaMoOTSVWsXblcoV0Jj0qn5M818ZFWss-f4g1_rvLq2S5Ypj0vJwSa2QGButkaUp3cIPcN57eYdBAvvgKozOKm4c5jWGNkdn6aGSc4WLIWq-419BSl9rAR--eikfaTYyeMDv_SEZAbTtILx9eQNNobM5YNknv8iKqN1mVZYFlnh8WDHkr9wJQPJWZI5JQ6TBIuEQEETOV5iA9o1iT4zCaCsgfQSQ6y1w_cLXQDt8ic4PhALYJDX0kU-hQJZrnAle2SlWgle_qJPQTl-c7xJ2MeZxVEuQmE8ZLbF0RR8UWp9jgFFc47ZCT7zpvpQDHn6U3DQZTJcvh3_3l_xFZaEbtm_im1bneI4umqfJ13z6ZK0ZjOEDGUaSHdp59AQoGztA
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DisenDreamer%3A+Subject-Driven+Text-to-Image+Generation+With+Sample-Aware+Disentangled+Tuning&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Chen%2C+Hong&rft.au=Zhang%2C+Yipeng&rft.au=Wang%2C+Xin&rft.au=Duan%2C+Xuguang&rft.date=2024-08-01&rft.pub=IEEE&rft.issn=1051-8215&rft.volume=34&rft.issue=8&rft.spage=6860&rft.epage=6873&rft_id=info:doi/10.1109%2FTCSVT.2024.3369757&rft.externalDocID=10445245
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon