DisenDreamer: Subject-Driven Text-to-Image Generation With Sample-Aware Disentangled Tuning

Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., th...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology Vol. 34; no. 8; pp. 6860 - 6873
Main Authors	Chen, Hong, Zhang, Yipeng, Wang, Xin, Duan, Xuguang, Zhou, Yuwei, Zhu, Wenwu
Format	Journal Article
Language	English
Published	IEEE 01.08.2024
Subjects	Circuits and systems Controllability Diffusion model disentangled finetuning Image synthesis Noise reduction subject-driven text-to-image generation Training Tuning Visualization
Online Access	Get full text

Cover

Loading…

Abstract	Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability.
AbstractList	Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability.
Author	Zhou, Yuwei Zhu, Wenwu Chen, Hong Wang, Xin Zhang, Yipeng Duan, Xuguang
Author_xml	– sequence: 1 givenname: Hong orcidid: 0000-0002-0943-2286 surname: Chen fullname: Chen, Hong email: h-chen20@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 2 givenname: Yipeng orcidid: 0009-0002-0886-8296 surname: Zhang fullname: Zhang, Yipeng email: zhang-yp22@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 3 givenname: Xin orcidid: 0000-0002-0351-2939 surname: Wang fullname: Wang, Xin email: xin_wang@tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 4 givenname: Xuguang orcidid: 0000-0001-9108-9618 surname: Duan fullname: Duan, Xuguang email: dxg18@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 5 givenname: Yuwei orcidid: 0000-0001-9582-7331 surname: Zhou fullname: Zhou, Yuwei email: zhou-yw21@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 6 givenname: Wenwu orcidid: 0000-0003-2236-9290 surname: Zhu fullname: Zhu, Wenwu email: wwzhu@tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China
BookMark	eNp9kLtOwzAUhi1UJNrCCyAGv4CLr4nDVrVQKlViaICBIXKSk-IqcSrH5fL29DYgBqbzL9-v_3wD1HOtA4SuGR0xRpPbdLJ8SUeccjkSIkpiFZ-hPlNKE86p6u0yVYxoztQFGnTdmlImtYz76G1qO3BTD6YBf4eX23wNRSBTbz_A4RS-AgktmTdmBXgGDrwJtnX41YZ3vDTNpgYy_jQe8KEnGLeqocTp1lm3ukTnlak7uDrdIXp-uE8nj2TxNJtPxgtS8EgHolkueAyJyFVBpdFJqWMjFChaViLJK1A84pGodn_uAUVjkKYsChbFrJQlFUOkj72Fb7vOQ5UVNhx2Bm9snTGa7SVlB0nZXlJ2krRD-R90421j_Pf_0M0RsgDwC5BScanED48SddU
CODEN	ITCTEM
CitedBy_id	crossref_primary_10_1145_3701988 crossref_primary_10_1109_TCSVT_2024_3449290 crossref_primary_10_1109_TCSVT_2024_3430529 crossref_primary_10_1109_TCSVT_2024_3485236 crossref_primary_10_1109_TCSVT_2024_3472036 crossref_primary_10_1145_3675164 crossref_primary_10_1109_TCSVT_2024_3443122
Cites_doi	10.1109/ICCV51070.2023.00685 10.1109/ICME51207.2021.9428193 10.1109/ICCV51070.2023.00673 10.1109/CVPR52688.2022.01767 10.1109/TPAMI.2013.50 10.1109/CVPR52729.2023.02155 10.1007/978-3-031-19787-1_41 10.18653/v1/P19-1041 10.1109/ICCV51070.2023.01461 10.1109/CVPR.2018.00143 10.1109/CVPR42600.2020.00790 10.1109/ICCV51070.2023.00355 10.1109/CVPR52688.2022.00246 10.1109/CVPR.2018.00018 10.1145/3592133 10.1109/CVPR.2018.00068 10.1007/978-3-031-19784-0_41 10.1007/978-3-319-24574-4_28 10.1109/TIP.2023.3240024 10.1109/CVPR52729.2023.00979 10.1109/ICCV51070.2023.02065 10.1109/CVPR52729.2023.00977 10.1109/CVPR52688.2022.01042
ContentType	Journal Article
DBID	97E RIA RIE AAYXX CITATION
DOI	10.1109/TCSVT.2024.3369757
DatabaseName	IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef
DatabaseTitle	CrossRef
DatabaseTitleList
Database_xml	– sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher
DeliveryMethod	fulltext_linktorsrc
Discipline	Engineering
EISSN	1558-2205
EndPage	6873
ExternalDocumentID	10_1109_TCSVT_2024_3369757 10445245
Genre	orig-research
GrantInformation_xml	– fundername: Beijing Key Laboratory of Networked Multimedia – fundername: National Natural Science Foundation of China grantid: 62222209; 62250008; 62102222 funderid: 10.13039/501100001809 – fundername: Beijing National Research Center for Information Science and Technology grantid: BNR2023RC01003; BNR2023TD03006 funderid: 10.13039/501100017582 – fundername: National Key Research and Development Program of China grantid: 2023YFF1205001 funderid: 10.13039/501100012166
GroupedDBID	-~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 VH1 AAYXX CITATION RIG
ID	FETCH-LOGICAL-c268t-81b327e93b5c04a89d87a35e50df39bfe526263f109c268507e4adcc1671d4d03
IEDL.DBID	RIE
ISSN	1051-8215
IngestDate	Tue Jul 01 00:41:25 EDT 2025 Thu Apr 24 22:52:35 EDT 2025 Wed Aug 27 02:32:45 EDT 2025
IsPeerReviewed	true
IsScholarly	true
Issue	8
Language	English
License	https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037
LinkModel	DirectLink
MergedId	FETCHMERGED-LOGICAL-c268t-81b327e93b5c04a89d87a35e50df39bfe526263f109c268507e4adcc1671d4d03
ORCID	0000-0001-9108-9618 0000-0002-0943-2286 0000-0002-0351-2939 0000-0001-9582-7331 0000-0003-2236-9290 0009-0002-0886-8296
PageCount	14
ParticipantIDs	ieee_primary_10445245 crossref_citationtrail_10_1109_TCSVT_2024_3369757 crossref_primary_10_1109_TCSVT_2024_3369757
ProviderPackageCode	CITATION AAYXX
PublicationCentury	2000
PublicationDate	2024-08-01
PublicationDateYYYYMMDD	2024-08-01
PublicationDate_xml	– month: 08 year: 2024 text: 2024-08-01 day: 01
PublicationDecade	2020
PublicationTitle	IEEE transactions on circuits and systems for video technology
PublicationTitleAbbrev	TCSVT
PublicationYear	2024
Publisher	IEEE
Publisher_xml	– name: IEEE
References	ref13 ref15 Hertz (ref17) Brown (ref5) ref52 ref11 Ma (ref28) Ramesh (ref34) 2022 Ho (ref18); 33 Kumari (ref24) 2022 Chang (ref6) 2023 Ma (ref30) 2023 Nichol (ref32) Hu (ref19) Wang (ref42) 2022 ref51 ref50 Chen (ref7) ref46 ref45 ref48 ref47 ref43 Song (ref41) ref49 Brooks (ref4) 2022 Meng (ref31) 2021 ref9 ref3 Locatello (ref26) Saharia (ref39); 35 Loshchilov (ref27) ref37 ref36 Gonzalez-Garcia (ref14); 31 Radford (ref33) Kawar (ref21) 2022 ref2 ref1 Gal (ref12) 2022 ref38 Kim (ref23) 2023 He (ref16) Shi (ref40) 2023 Wang (ref44) ref25 ref20 ref22 Ramesh (ref35) ref29 Dong (ref10) 2022 Chen (ref8) 2023
References_xml	– ident: ref47 doi: 10.1109/ICCV51070.2023.00685 – year: 2022 ident: ref21 article-title: Imagic: Text-based real image editing with diffusion models publication-title: arXiv:2210.09276 – ident: ref43 doi: 10.1109/ICME51207.2021.9428193 – ident: ref15 doi: 10.1109/ICCV51070.2023.00673 – ident: ref1 doi: 10.1109/CVPR52688.2022.01767 – ident: ref3 doi: 10.1109/TPAMI.2013.50 – start-page: 1877 volume-title: Proc. NIPS ident: ref5 article-title: Language models are few-shot learners – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref17 article-title: Prompt-to-prompt image editing with cross attention control – year: 2022 ident: ref12 article-title: An image is worth one word: Personalizing text-to-image generation using textual inversion publication-title: arXiv:2208.01618 – start-page: 5712 volume-title: Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst. ident: ref28 article-title: Learning disentangled representations for recommendation – volume: 35 start-page: 36479 volume-title: Proc. NIPS ident: ref39 article-title: Photorealistic text-to-image diffusion models with deep language understanding – year: 2023 ident: ref30 article-title: Unified multi-modal latent diffusion for joint subject and text conditional image generation publication-title: arXiv:2303.09319 – ident: ref38 doi: 10.1109/CVPR52729.2023.02155 – ident: ref46 doi: 10.1007/978-3-031-19787-1_41 – start-page: 8821 volume-title: Proc. Int. Conf. Mach. Learn. (ICML) ident: ref35 article-title: Zero-shot text-to-image generation – ident: ref20 doi: 10.18653/v1/P19-1041 – ident: ref45 doi: 10.1109/ICCV51070.2023.01461 – year: 2022 ident: ref10 article-title: DreamArtist: Towards controllable one-shot text-to-image generation via positive-negative prompt-tuning publication-title: arXiv:2211.11337 – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref41 article-title: Denoising diffusion implicit models – ident: ref48 doi: 10.1109/CVPR.2018.00143 – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref19 article-title: LoRa: Low-rank adaptation of large language models – ident: ref25 doi: 10.1109/CVPR42600.2020.00790 – ident: ref49 doi: 10.1109/ICCV51070.2023.00355 – start-page: 26924 volume-title: Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst. ident: ref7 article-title: Curriculum disentangled recommendation with noisy multi-feedback – year: 2023 ident: ref23 article-title: Towards safe self-distillation of internet-scale text-to-image diffusion models publication-title: arXiv:2307.05977 – ident: ref22 doi: 10.1109/CVPR52688.2022.00246 – year: 2022 ident: ref24 article-title: Multi-concept customization of Text-to-Image diffusion publication-title: arXiv:2212.04488 – ident: ref29 doi: 10.1109/CVPR.2018.00018 – ident: ref13 doi: 10.1145/3592133 – ident: ref50 doi: 10.1109/CVPR.2018.00068 – start-page: 4114 volume-title: Proc. 36th Int. Conf. Mach. Learn. ident: ref26 article-title: Challenging common assumptions in the unsupervised learning of disentangled representations – year: 2022 ident: ref42 article-title: Disentangled representation learning publication-title: arXiv:2211.11695 – start-page: 1 volume-title: Proc. 9th Int. Conf. Learn. Represent. (ICLR) ident: ref16 article-title: DeBERTa: Decoding-enhanced BERT with disentangled attention – start-page: 16784 volume-title: Proc. Int. Conf. Mach. Learn. (ICML) ident: ref32 article-title: GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models – ident: ref2 doi: 10.1007/978-3-031-19784-0_41 – ident: ref37 doi: 10.1007/978-3-319-24574-4_28 – ident: ref9 doi: 10.1109/TIP.2023.3240024 – year: 2023 ident: ref40 article-title: InstantBooth: Personalized text-to-image generation without test-time finetuning publication-title: arXiv:2304.03411 – volume: 33 start-page: 6840 volume-title: Proc. NIPS ident: ref18 article-title: Denoising diffusion probabilistic models – ident: ref52 doi: 10.1109/CVPR52729.2023.00979 – ident: ref51 doi: 10.1109/ICCV51070.2023.02065 – year: 2022 ident: ref4 article-title: InstructPix2Pix: Learning to follow image editing instructions publication-title: arXiv:2211.09800 – ident: ref11 doi: 10.1109/CVPR52729.2023.00977 – year: 2023 ident: ref6 article-title: Muse: Text-to-image generation via masked generative transformers publication-title: arXiv:2301.00704 – start-page: 8748 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref33 article-title: Learning transferable visual models from natural language supervision – ident: ref36 doi: 10.1109/CVPR52688.2022.01042 – year: 2021 ident: ref31 article-title: SDEdit: Guided image synthesis and editing with stochastic differential equations publication-title: arXiv:2108.01073 – start-page: 36174 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref44 article-title: Curriculum co-disentangled representation learning across multiple environments for social recommendation – year: 2022 ident: ref34 article-title: Hierarchical text-conditional image generation with CLIP latents publication-title: arXiv:2204.06125 – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref27 article-title: Decoupled weight decay regularization – volume: 31 start-page: 1294 volume-title: Proc. Adv. Neural Inf. Process. Syst. ident: ref14 article-title: Image-to-image translation for cross-domain disentanglement – year: 2023 ident: ref8 article-title: Subject-driven text-to-image generation via apprenticeship learning publication-title: arXiv:2304.00186
SSID	ssj0014847
Score	2.5512915
Snippet	Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing...
SourceID	crossref ieee
SourceType	Enrichment Source Index Database Publisher
StartPage	6860
SubjectTerms	Circuits and systems Controllability Diffusion model disentangled finetuning Image synthesis Noise reduction subject-driven text-to-image generation Training Tuning Visualization
Title	DisenDreamer: Subject-Driven Text-to-Image Generation With Sample-Aware Disentangled Tuning
URI	https://ieeexplore.ieee.org/document/10445245
Volume	34
hasFullText	1
inHoldings	1
isFullTextHit
isPrint
link	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9uJz34OXF-kYM3yWyb9CPexuaYgrus04GH0javOtw6mR2Cf70vaSdDULyVkoSQX5L3e0ne7xFykfk2uFbmskw5wARwj0mVcfRSHOmjTbeUifC-H3j9kbgbu-MqWN3EwgCAeXwGLf1p7vLVPF3qozJc4UK4jnBrpIaeWxms9X1lIAKTTQz5gs0CNGSrCBlLXoWd4UOIvqAjWpx70te2aM0KraVVMValt0MGq_6Uj0leW8siaaWfP6Qa_93hXbJd8UvaLifEHtmAfJ9srakOHpCn7uQd8i6yxRksriluHfoshnUXeuOjoXaFizm7neFOQ0tVag0efZwUL3QYazVh1v6IF0BNO0gun6egaLjUJywNMurdhJ0-q3IssNTxgoIha-WOD5InbmqJOJAq8GPuIn6ImEwycB2tV5PhMOoKyB5BxCpNbc-3lVAWPyT1fJ7DEaGBh1QK3alY8UzgupaJ9JX0bBUHXmzzrEns1ZhHaSVArvNgTCPjiFgyMjhFGqeowqlJLr_rvJXyG3-WbmgM1kqWw3_8y_8Tsqmrl-_5Tkm9WCzhDDlGkZybufUFPgnLhw
linkProvider	IEEE
linkToHtml	http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9-HNSD3-K3OXiTzLZJ2sabbMqmbpd1OvBQ2uZVRd1kdgj-9b6knQxB8VZKEkJ-Sd7vJXm_R8hxHrggnVyyXHvABHCfKZ1z9FI8FaBNd7SN8G53_GZPXPVlvwpWt7EwAGAfn0HNfNq7fD3MxuaoDFe4ENITcpbMo-GXbhmu9X1pIEKbTwwZg8tCNGWTGBlHnUb17m2E3qAnapz7KjDWaMoOTSVWsXblcoV0Jj0qn5M818ZFWss-f4g1_rvLq2S5Ypj0vJwSa2QGButkaUp3cIPcN57eYdBAvvgKozOKm4c5jWGNkdn6aGSc4WLIWq-419BSl9rAR--eikfaTYyeMDv_SEZAbTtILx9eQNNobM5YNknv8iKqN1mVZYFlnh8WDHkr9wJQPJWZI5JQ6TBIuEQEETOV5iA9o1iT4zCaCsgfQSQ6y1w_cLXQDt8ic4PhALYJDX0kU-hQJZrnAle2SlWgle_qJPQTl-c7xJ2MeZxVEuQmE8ZLbF0RR8UWp9jgFFc47ZCT7zpvpQDHn6U3DQZTJcvh3_3l_xFZaEbtm_im1bneI4umqfJ13z6ZK0ZjOEDGUaSHdp59AQoGztA
openUrl	ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DisenDreamer%3A+Subject-Driven+Text-to-Image+Generation+With+Sample-Aware+Disentangled+Tuning&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Chen%2C+Hong&rft.au=Zhang%2C+Yipeng&rft.au=Wang%2C+Xin&rft.au=Duan%2C+Xuguang&rft.date=2024-08-01&rft.pub=IEEE&rft.issn=1051-8215&rft.volume=34&rft.issue=8&rft.spage=6860&rft.epage=6873&rft_id=info:doi/10.1109%2FTCSVT.2024.3369757&rft.externalDocID=10445245
thumbnail_l	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon
thumbnail_m	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon
thumbnail_s	http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon