DisenDreamer: Subject-Driven Text-to-Image Generation With Sample-Aware Disentangled Tuning
Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., th...
Saved in:
Published in | IEEE transactions on circuits and systems for video technology Vol. 34; no. 8; pp. 6860 - 6873 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
IEEE
01.08.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability. |
---|---|
AbstractList | Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing attention recently. Existing methods mainly resort to finetuning a pretrained generative model, where the identity-relevant information (e.g., the boy) and the identity-irrelevant sample-specific information (e.g., the background or the pose of the boy) are entangled in the latent embedding space. However, the highly entangled latent embedding may lead to low subject identity fidelity and text prompt fidelity. To tackle the problems, we propose DisenDreamer, a sample-aware disentangled tuning framework for subject-driven text-to-image generation in this paper. Specifically, DisenDreamer finetunes the pretrained diffusion model in the denoising process. Different from previous works that utilize an entangled embedding to denoise, DisenDreamer instead utilizes a common text embedding to capture the identity-relevant information and a sample-specific visual embedding to capture the identity-irrelevant information. To disentangle the two embeddings, we further design the novel weak common denoising, weak sample-aware denoising, and the contrastive embedding auxiliary tuning objectives. Extensive experiments show that our proposed DisenDreamer framework outperforms baseline models for subject-driven text-to-image generation. Additionally, by combining the identity-relevant and the identity-irrelevant embedding, DisenDreamer demonstrates more generation flexibility and controllability. |
Author | Zhou, Yuwei Zhu, Wenwu Chen, Hong Wang, Xin Zhang, Yipeng Duan, Xuguang |
Author_xml | – sequence: 1 givenname: Hong orcidid: 0000-0002-0943-2286 surname: Chen fullname: Chen, Hong email: h-chen20@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 2 givenname: Yipeng orcidid: 0009-0002-0886-8296 surname: Zhang fullname: Zhang, Yipeng email: zhang-yp22@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 3 givenname: Xin orcidid: 0000-0002-0351-2939 surname: Wang fullname: Wang, Xin email: xin_wang@tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 4 givenname: Xuguang orcidid: 0000-0001-9108-9618 surname: Duan fullname: Duan, Xuguang email: dxg18@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 5 givenname: Yuwei orcidid: 0000-0001-9582-7331 surname: Zhou fullname: Zhou, Yuwei email: zhou-yw21@mails.tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China – sequence: 6 givenname: Wenwu orcidid: 0000-0003-2236-9290 surname: Zhu fullname: Zhu, Wenwu email: wwzhu@tsinghua.edu.cn organization: Department of Computer Science and Technology, Tsinghua University, Beijing, China |
BookMark | eNp9kLtOwzAUhi1UJNrCCyAGv4CLr4nDVrVQKlViaICBIXKSk-IqcSrH5fL29DYgBqbzL9-v_3wD1HOtA4SuGR0xRpPbdLJ8SUeccjkSIkpiFZ-hPlNKE86p6u0yVYxoztQFGnTdmlImtYz76G1qO3BTD6YBf4eX23wNRSBTbz_A4RS-AgktmTdmBXgGDrwJtnX41YZ3vDTNpgYy_jQe8KEnGLeqocTp1lm3ukTnlak7uDrdIXp-uE8nj2TxNJtPxgtS8EgHolkueAyJyFVBpdFJqWMjFChaViLJK1A84pGodn_uAUVjkKYsChbFrJQlFUOkj72Fb7vOQ5UVNhx2Bm9snTGa7SVlB0nZXlJ2krRD-R90421j_Pf_0M0RsgDwC5BScanED48SddU |
CODEN | ITCTEM |
CitedBy_id | crossref_primary_10_1145_3701988 crossref_primary_10_1109_TCSVT_2024_3449290 crossref_primary_10_1109_TCSVT_2024_3430529 crossref_primary_10_1109_TCSVT_2024_3485236 crossref_primary_10_1109_TCSVT_2024_3472036 crossref_primary_10_1145_3675164 crossref_primary_10_1109_TCSVT_2024_3443122 |
Cites_doi | 10.1109/ICCV51070.2023.00685 10.1109/ICME51207.2021.9428193 10.1109/ICCV51070.2023.00673 10.1109/CVPR52688.2022.01767 10.1109/TPAMI.2013.50 10.1109/CVPR52729.2023.02155 10.1007/978-3-031-19787-1_41 10.18653/v1/P19-1041 10.1109/ICCV51070.2023.01461 10.1109/CVPR.2018.00143 10.1109/CVPR42600.2020.00790 10.1109/ICCV51070.2023.00355 10.1109/CVPR52688.2022.00246 10.1109/CVPR.2018.00018 10.1145/3592133 10.1109/CVPR.2018.00068 10.1007/978-3-031-19784-0_41 10.1007/978-3-319-24574-4_28 10.1109/TIP.2023.3240024 10.1109/CVPR52729.2023.00979 10.1109/ICCV51070.2023.02065 10.1109/CVPR52729.2023.00977 10.1109/CVPR52688.2022.01042 |
ContentType | Journal Article |
DBID | 97E RIA RIE AAYXX CITATION |
DOI | 10.1109/TCSVT.2024.3369757 |
DatabaseName | IEEE All-Society Periodicals Package (ASPP) 2005-present IEEE All-Society Periodicals Package (ASPP) 1998–Present IEEE Electronic Library (IEL) CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: RIE name: IEEE Electronic Library (IEL) url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/ sourceTypes: Publisher |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Engineering |
EISSN | 1558-2205 |
EndPage | 6873 |
ExternalDocumentID | 10_1109_TCSVT_2024_3369757 10445245 |
Genre | orig-research |
GrantInformation_xml | – fundername: Beijing Key Laboratory of Networked Multimedia – fundername: National Natural Science Foundation of China grantid: 62222209; 62250008; 62102222 funderid: 10.13039/501100001809 – fundername: Beijing National Research Center for Information Science and Technology grantid: BNR2023RC01003; BNR2023TD03006 funderid: 10.13039/501100017582 – fundername: National Key Research and Development Program of China grantid: 2023YFF1205001 funderid: 10.13039/501100012166 |
GroupedDBID | -~X 0R~ 29I 4.4 5GY 5VS 6IK 97E AAJGR AARMG AASAJ AAWTH ABAZT ABQJQ ABVLG ACGFO ACGFS ACIWK AENEX AETIX AGQYO AGSQL AHBIQ AI. AIBXA AKJIK AKQYR ALLEH ALMA_UNASSIGNED_HOLDINGS ASUFR ATWAV BEFXN BFFAM BGNUA BKEBE BPEOZ CS3 DU5 EBS EJD HZ~ H~9 ICLAB IFIPE IFJZH IPLJI JAVBF LAI M43 O9- OCL P2P RIA RIE RNS RXW TAE TN5 VH1 AAYXX CITATION RIG |
ID | FETCH-LOGICAL-c268t-81b327e93b5c04a89d87a35e50df39bfe526263f109c268507e4adcc1671d4d03 |
IEDL.DBID | RIE |
ISSN | 1051-8215 |
IngestDate | Tue Jul 01 00:41:25 EDT 2025 Thu Apr 24 22:52:35 EDT 2025 Wed Aug 27 02:32:45 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 8 |
Language | English |
License | https://ieeexplore.ieee.org/Xplorehelp/downloads/license-information/IEEE.html https://doi.org/10.15223/policy-029 https://doi.org/10.15223/policy-037 |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c268t-81b327e93b5c04a89d87a35e50df39bfe526263f109c268507e4adcc1671d4d03 |
ORCID | 0000-0001-9108-9618 0000-0002-0943-2286 0000-0002-0351-2939 0000-0001-9582-7331 0000-0003-2236-9290 0009-0002-0886-8296 |
PageCount | 14 |
ParticipantIDs | ieee_primary_10445245 crossref_citationtrail_10_1109_TCSVT_2024_3369757 crossref_primary_10_1109_TCSVT_2024_3369757 |
ProviderPackageCode | CITATION AAYXX |
PublicationCentury | 2000 |
PublicationDate | 2024-08-01 |
PublicationDateYYYYMMDD | 2024-08-01 |
PublicationDate_xml | – month: 08 year: 2024 text: 2024-08-01 day: 01 |
PublicationDecade | 2020 |
PublicationTitle | IEEE transactions on circuits and systems for video technology |
PublicationTitleAbbrev | TCSVT |
PublicationYear | 2024 |
Publisher | IEEE |
Publisher_xml | – name: IEEE |
References | ref13 ref15 Hertz (ref17) Brown (ref5) ref52 ref11 Ma (ref28) Ramesh (ref34) 2022 Ho (ref18); 33 Kumari (ref24) 2022 Chang (ref6) 2023 Ma (ref30) 2023 Nichol (ref32) Hu (ref19) Wang (ref42) 2022 ref51 ref50 Chen (ref7) ref46 ref45 ref48 ref47 ref43 Song (ref41) ref49 Brooks (ref4) 2022 Meng (ref31) 2021 ref9 ref3 Locatello (ref26) Saharia (ref39); 35 Loshchilov (ref27) ref37 ref36 Gonzalez-Garcia (ref14); 31 Radford (ref33) Kawar (ref21) 2022 ref2 ref1 Gal (ref12) 2022 ref38 Kim (ref23) 2023 He (ref16) Shi (ref40) 2023 Wang (ref44) ref25 ref20 ref22 Ramesh (ref35) ref29 Dong (ref10) 2022 Chen (ref8) 2023 |
References_xml | – ident: ref47 doi: 10.1109/ICCV51070.2023.00685 – year: 2022 ident: ref21 article-title: Imagic: Text-based real image editing with diffusion models publication-title: arXiv:2210.09276 – ident: ref43 doi: 10.1109/ICME51207.2021.9428193 – ident: ref15 doi: 10.1109/ICCV51070.2023.00673 – ident: ref1 doi: 10.1109/CVPR52688.2022.01767 – ident: ref3 doi: 10.1109/TPAMI.2013.50 – start-page: 1877 volume-title: Proc. NIPS ident: ref5 article-title: Language models are few-shot learners – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref17 article-title: Prompt-to-prompt image editing with cross attention control – year: 2022 ident: ref12 article-title: An image is worth one word: Personalizing text-to-image generation using textual inversion publication-title: arXiv:2208.01618 – start-page: 5712 volume-title: Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst. ident: ref28 article-title: Learning disentangled representations for recommendation – volume: 35 start-page: 36479 volume-title: Proc. NIPS ident: ref39 article-title: Photorealistic text-to-image diffusion models with deep language understanding – year: 2023 ident: ref30 article-title: Unified multi-modal latent diffusion for joint subject and text conditional image generation publication-title: arXiv:2303.09319 – ident: ref38 doi: 10.1109/CVPR52729.2023.02155 – ident: ref46 doi: 10.1007/978-3-031-19787-1_41 – start-page: 8821 volume-title: Proc. Int. Conf. Mach. Learn. (ICML) ident: ref35 article-title: Zero-shot text-to-image generation – ident: ref20 doi: 10.18653/v1/P19-1041 – ident: ref45 doi: 10.1109/ICCV51070.2023.01461 – year: 2022 ident: ref10 article-title: DreamArtist: Towards controllable one-shot text-to-image generation via positive-negative prompt-tuning publication-title: arXiv:2211.11337 – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref41 article-title: Denoising diffusion implicit models – ident: ref48 doi: 10.1109/CVPR.2018.00143 – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref19 article-title: LoRa: Low-rank adaptation of large language models – ident: ref25 doi: 10.1109/CVPR42600.2020.00790 – ident: ref49 doi: 10.1109/ICCV51070.2023.00355 – start-page: 26924 volume-title: Proc. Adv. Neural Inf. Process. Syst., Annu. Conf. Neural Inf. Process. Syst. ident: ref7 article-title: Curriculum disentangled recommendation with noisy multi-feedback – year: 2023 ident: ref23 article-title: Towards safe self-distillation of internet-scale text-to-image diffusion models publication-title: arXiv:2307.05977 – ident: ref22 doi: 10.1109/CVPR52688.2022.00246 – year: 2022 ident: ref24 article-title: Multi-concept customization of Text-to-Image diffusion publication-title: arXiv:2212.04488 – ident: ref29 doi: 10.1109/CVPR.2018.00018 – ident: ref13 doi: 10.1145/3592133 – ident: ref50 doi: 10.1109/CVPR.2018.00068 – start-page: 4114 volume-title: Proc. 36th Int. Conf. Mach. Learn. ident: ref26 article-title: Challenging common assumptions in the unsupervised learning of disentangled representations – year: 2022 ident: ref42 article-title: Disentangled representation learning publication-title: arXiv:2211.11695 – start-page: 1 volume-title: Proc. 9th Int. Conf. Learn. Represent. (ICLR) ident: ref16 article-title: DeBERTa: Decoding-enhanced BERT with disentangled attention – start-page: 16784 volume-title: Proc. Int. Conf. Mach. Learn. (ICML) ident: ref32 article-title: GLIDE: Towards photorealistic image generation and editing with text-guided diffusion models – ident: ref2 doi: 10.1007/978-3-031-19784-0_41 – ident: ref37 doi: 10.1007/978-3-319-24574-4_28 – ident: ref9 doi: 10.1109/TIP.2023.3240024 – year: 2023 ident: ref40 article-title: InstantBooth: Personalized text-to-image generation without test-time finetuning publication-title: arXiv:2304.03411 – volume: 33 start-page: 6840 volume-title: Proc. NIPS ident: ref18 article-title: Denoising diffusion probabilistic models – ident: ref52 doi: 10.1109/CVPR52729.2023.00979 – ident: ref51 doi: 10.1109/ICCV51070.2023.02065 – year: 2022 ident: ref4 article-title: InstructPix2Pix: Learning to follow image editing instructions publication-title: arXiv:2211.09800 – ident: ref11 doi: 10.1109/CVPR52729.2023.00977 – year: 2023 ident: ref6 article-title: Muse: Text-to-image generation via masked generative transformers publication-title: arXiv:2301.00704 – start-page: 8748 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref33 article-title: Learning transferable visual models from natural language supervision – ident: ref36 doi: 10.1109/CVPR52688.2022.01042 – year: 2021 ident: ref31 article-title: SDEdit: Guided image synthesis and editing with stochastic differential equations publication-title: arXiv:2108.01073 – start-page: 36174 volume-title: Proc. Int. Conf. Mach. Learn. ident: ref44 article-title: Curriculum co-disentangled representation learning across multiple environments for social recommendation – year: 2022 ident: ref34 article-title: Hierarchical text-conditional image generation with CLIP latents publication-title: arXiv:2204.06125 – start-page: 1 volume-title: Proc. Int. Conf. Learn. Represent. ident: ref27 article-title: Decoupled weight decay regularization – volume: 31 start-page: 1294 volume-title: Proc. Adv. Neural Inf. Process. Syst. ident: ref14 article-title: Image-to-image translation for cross-domain disentanglement – year: 2023 ident: ref8 article-title: Subject-driven text-to-image generation via apprenticeship learning publication-title: arXiv:2304.00186 |
SSID | ssj0014847 |
Score | 2.5512915 |
Snippet | Subject-driven text-to-image generation aims to generate customized images of the given subject based on the text descriptions, which has drawn increasing... |
SourceID | crossref ieee |
SourceType | Enrichment Source Index Database Publisher |
StartPage | 6860 |
SubjectTerms | Circuits and systems Controllability Diffusion model disentangled finetuning Image synthesis Noise reduction subject-driven text-to-image generation Training Tuning Visualization |
Title | DisenDreamer: Subject-Driven Text-to-Image Generation With Sample-Aware Disentangled Tuning |
URI | https://ieeexplore.ieee.org/document/10445245 |
Volume | 34 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9uJz34OXF-kYM3yWyb9CPexuaYgrus04GH0javOtw6mR2Cf70vaSdDULyVkoSQX5L3e0ne7xFykfk2uFbmskw5wARwj0mVcfRSHOmjTbeUifC-H3j9kbgbu-MqWN3EwgCAeXwGLf1p7vLVPF3qozJc4UK4jnBrpIaeWxms9X1lIAKTTQz5gs0CNGSrCBlLXoWd4UOIvqAjWpx70te2aM0KraVVMValt0MGq_6Uj0leW8siaaWfP6Qa_93hXbJd8UvaLifEHtmAfJ9srakOHpCn7uQd8i6yxRksriluHfoshnUXeuOjoXaFizm7neFOQ0tVag0efZwUL3QYazVh1v6IF0BNO0gun6egaLjUJywNMurdhJ0-q3IssNTxgoIha-WOD5InbmqJOJAq8GPuIn6ImEwycB2tV5PhMOoKyB5BxCpNbc-3lVAWPyT1fJ7DEaGBh1QK3alY8UzgupaJ9JX0bBUHXmzzrEns1ZhHaSVArvNgTCPjiFgyMjhFGqeowqlJLr_rvJXyG3-WbmgM1kqWw3_8y_8Tsqmrl-_5Tkm9WCzhDDlGkZybufUFPgnLhw |
linkProvider | IEEE |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV3NS8MwFA9-HNSD3-K3OXiTzLZJ2sabbMqmbpd1OvBQ2uZVRd1kdgj-9b6knQxB8VZKEkJ-Sd7vJXm_R8hxHrggnVyyXHvABHCfKZ1z9FI8FaBNd7SN8G53_GZPXPVlvwpWt7EwAGAfn0HNfNq7fD3MxuaoDFe4ENITcpbMo-GXbhmu9X1pIEKbTwwZg8tCNGWTGBlHnUb17m2E3qAnapz7KjDWaMoOTSVWsXblcoV0Jj0qn5M818ZFWss-f4g1_rvLq2S5Ypj0vJwSa2QGButkaUp3cIPcN57eYdBAvvgKozOKm4c5jWGNkdn6aGSc4WLIWq-419BSl9rAR--eikfaTYyeMDv_SEZAbTtILx9eQNNobM5YNknv8iKqN1mVZYFlnh8WDHkr9wJQPJWZI5JQ6TBIuEQEETOV5iA9o1iT4zCaCsgfQSQ6y1w_cLXQDt8ic4PhALYJDX0kU-hQJZrnAle2SlWgle_qJPQTl-c7xJ2MeZxVEuQmE8ZLbF0RR8UWp9jgFFc47ZCT7zpvpQDHn6U3DQZTJcvh3_3l_xFZaEbtm_im1bneI4umqfJ13z6ZK0ZjOEDGUaSHdp59AQoGztA |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=DisenDreamer%3A+Subject-Driven+Text-to-Image+Generation+With+Sample-Aware+Disentangled+Tuning&rft.jtitle=IEEE+transactions+on+circuits+and+systems+for+video+technology&rft.au=Chen%2C+Hong&rft.au=Zhang%2C+Yipeng&rft.au=Wang%2C+Xin&rft.au=Duan%2C+Xuguang&rft.date=2024-08-01&rft.pub=IEEE&rft.issn=1051-8215&rft.volume=34&rft.issue=8&rft.spage=6860&rft.epage=6873&rft_id=info:doi/10.1109%2FTCSVT.2024.3369757&rft.externalDocID=10445245 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1051-8215&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1051-8215&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1051-8215&client=summon |