Training-free subject-enhanced attention guidance for compositional text-to-image generation
•Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a novel evaluation metric GroundingScore for comprehensive assessment. [Display omitted] Existing subject-driven text-to-image generation models s...
Saved in:
Published in | Pattern recognition Vol. 170; p. 112111 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Elsevier Ltd
01.02.2026
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | •Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a novel evaluation metric GroundingScore for comprehensive assessment.
[Display omitted]
Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel GroundingScore metric to thoroughly assess subject alignment. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. |
---|---|
AbstractList | •Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a novel evaluation metric GroundingScore for comprehensive assessment.
[Display omitted]
Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel GroundingScore metric to thoroughly assess subject alignment. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. |
ArticleNumber | 112111 |
Author | Wang, Bo Chen, Quan Ma, Ye Yang, Te Dong, Di Liu, Shengyuan |
Author_xml | – sequence: 1 givenname: Shengyuan orcidid: 0000-0003-2317-3783 surname: Liu fullname: Liu, Shengyuan email: liushengyuan2021@ia.ac.cn organization: The Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, 999077, Hong Kong SAR, China – sequence: 2 givenname: Bo orcidid: 0000-0001-8848-3497 surname: Wang fullname: Wang, Bo email: wangbo0060@163.com organization: Kuaishou Technology, Beijing, 100000, China – sequence: 3 givenname: Ye surname: Ma fullname: Ma, Ye email: maye@kuaishou.com organization: Kuaishou Technology, Beijing, 100000, China – sequence: 4 givenname: Te surname: Yang fullname: Yang, Te email: yangte2021@ia.ac.cn organization: Institute of Automation, Chinese Academy of Sciences, Beijing, 100000, China – sequence: 5 givenname: Quan orcidid: 0000-0002-4865-2396 surname: Chen fullname: Chen, Quan email: myctllmail@163.com organization: Kuaishou Technology, Beijing, 100000, China – sequence: 6 givenname: Di surname: Dong fullname: Dong, Di email: di.dong@ia.ac.cn organization: Institute of Automation, Chinese Academy of Sciences, Beijing, 100000, China |
BookMark | eNp9kM1qwzAQhHVIoUnaN-jBLyBXK_9Fl0IJ_YNALz4WhCyvXJlECpJS2revjXvuaWFnZ5j9NmTlvENC7oDlwKC-H_OzStoPOWe8ygE4AKzImrECaMFZcU02MY6MQQMlX5OPNijrrBuoCYhZvHQj6kTRfSqnsc9USuiS9S4bLrafd5nxIdP-dPbRzoI6Zgm_E02e2pMaMBvQYVCzdEOujDpGvP2bW9I-P7X7V3p4f3nbPx6o5lWTqCrVjjdCGBBlD4iirkGbplNlZXa9bvre6EIwLgzrNOdiOu_qhjHBVKdUVWxJucTq4GMMaOQ5TFXCjwQmZyhylAsUOUORC5TJ9rDYcKr2ZTHIqC3OX9swMZC9t_8H_AJ97XLr |
Cites_doi | 10.1016/j.patcog.2023.109962 10.1145/3592116 10.1016/j.patcog.2022.109246 10.1007/978-3-031-72970-6_3 10.1016/j.patcog.2023.109883 |
ContentType | Journal Article |
Copyright | 2025 Elsevier Ltd |
Copyright_xml | – notice: 2025 Elsevier Ltd |
DBID | AAYXX CITATION |
DOI | 10.1016/j.patcog.2025.112111 |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Computer Science |
ExternalDocumentID | 10_1016_j_patcog_2025_112111 S003132032500771X |
GroupedDBID | --K --M -D8 -DT -~X .DC .~1 0R~ 123 1B1 1RT 1~. 1~5 29O 4.4 457 4G. 53G 5VS 7-5 71M 8P~ 9JN AABNK AAEDT AAEDW AAIKJ AAKOC AALRI AAOAW AAQFI AAQXK AATTM AAXKI AAXUO AAYFN AAYWO ABBOA ABDPE ABEFU ABFNM ABFRF ABHFT ABJNI ABMAC ABWVN ABXDB ACBEA ACDAQ ACGFO ACGFS ACNNM ACRLP ACRPL ACVFH ACZNC ADBBV ADCNI ADEZE ADJOM ADMUD ADMXK ADNMO ADTZH AEBSH AECPX AEFWE AEIPS AEKER AENEX AEUPX AFJKZ AFPUW AFTJW AGCQF AGHFR AGQPQ AGUBO AGYEJ AHHHB AHJVU AHZHX AIALX AIEXJ AIGII AIIUN AIKHN AITUG AKBMS AKRWK AKYEP ALMA_UNASSIGNED_HOLDINGS AMRAJ ANKPU AOUOD APXCP ASPBG AVWKF AXJTR AZFZN BJAXD BKOJK BLXMC CS3 DU5 EBS EFJIC EFKBS EJD EO8 EO9 EP2 EP3 F0J F5P FD6 FDB FEDTE FGOYB FIRID FNPLU FYGXN G-Q GBLVA GBOLZ HLZ HVGLF HZ~ H~9 IHE J1W JJJVA KOM KZ1 LG9 LMP LY1 M41 MO0 N9A O-L O9- OAUVE OZT P-8 P-9 P2P PC. Q38 R2- RIG RNS ROL RPZ SBC SDF SDG SDP SDS SES SEW SPC SPCBC SST SSV SSZ T5K TN5 UNMZH VOH WUQ XJE XPP ZMT ZY4 ~G- AAYXX AFXIZ AGRNS BNPGV CITATION SSH |
ID | FETCH-LOGICAL-c257t-a4a82799f194d1ee9661cf7ba45f8dc7ddfc39029f0bc2294a8b670090abaa53 |
IEDL.DBID | .~1 |
ISSN | 0031-3203 |
IngestDate | Thu Jul 24 02:15:52 EDT 2025 Sat Aug 16 17:00:39 EDT 2025 |
IsPeerReviewed | true |
IsScholarly | true |
Keywords | Compositional generation Subject-driven generation Diffusion model |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-c257t-a4a82799f194d1ee9661cf7ba45f8dc7ddfc39029f0bc2294a8b670090abaa53 |
ORCID | 0000-0001-8848-3497 0000-0003-2317-3783 0000-0002-4865-2396 |
ParticipantIDs | crossref_primary_10_1016_j_patcog_2025_112111 elsevier_sciencedirect_doi_10_1016_j_patcog_2025_112111 |
PublicationCentury | 2000 |
PublicationDate | February 2026 2026-02-00 |
PublicationDateYYYYMMDD | 2026-02-01 |
PublicationDate_xml | – month: 02 year: 2026 text: February 2026 |
PublicationDecade | 2020 |
PublicationTitle | Pattern recognition |
PublicationYear | 2026 |
Publisher | Elsevier Ltd |
Publisher_xml | – name: Elsevier Ltd |
References | Feng, He, Fu, Jampani, Akula, Narayana, Basu, Wang, Wang (bib0025) 2023 Ruiz, Li, Jampani, Pritch, Rubinstein, Aberman (bib0004) 2023 Song, Meng, Ermon (bib0032) 2021 Li, Li, Hoi (bib0007) 2024; 36 Shi, Xiong, Lin, Jung (bib0008) 2024 Kumari, Zhang, Zhang, Shechtman, Zhu (bib0006) 2023 Podell, English, Lacey, Blattmann, Dockhorn, Müller, Penna, Rombach (bib0017) 2024 Li, Zhang, Wu, Sun, Min, Liu, Zhai, Lin (bib0029) 2023; 34.8 Cherti, Beaumont, Wightman, Wortsman, Ilharco, Gordon, Schuhmann, Schmidt, Jitsev (bib0035) 2023 Xu, Liu, Wu, Tong, Li, Ding, Tang, Dong (bib0009) 2024; 36 Esser, Rombach, Ommer (bib0015) 2021 Han, Li, Zhang, Milanfar, Metaxas, Yang (bib0028) 2023 Chen, Laina, Vedaldi (bib0022) 2024 Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark (bib0026) 2021 X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, H. Li, Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis, (2023). arXiv preprint Chefer, Alaluf, Vinker, Wolf, Cohen-Or (bib0034) 2023; 42 Ho, Jain, Abbeel (bib0031) 2020; 33 Li, Liu, Yuan (bib0016) 2024 Tan, Yang, Ye, Wang, Yan, Nguyen, Huang (bib0011) 2023; 144 Schuhmann, Beaumont, Vencu, Gordon, Wightman, Cherti, Coombes, Katta, Mullis, Wortsman (bib0036) 2022; 35 Devlin, Chang, Lee, Toutanova (bib0040) 2019 Nichol, Dhariwal, Ramesh, Shyam, Mishkin, Mcgrew, Sutskever, Chen (bib0001) 2022 Couairon, Careil, Cord, Lathuilière, Verbeek (bib0027) 2023 Isola, Zhu, Zhou, Efros (bib0012) 2017 . Betker, Goh, Jing, Brooks, Wang, Li, Ouyang, Zhuang, Lee, Guo (bib0041) 2023; 2 Gal, Alaluf, Atzmon, Patashnik, Bermano, Chechik, Cohen-or (bib0005) 2023 Yu, Guan, Lu, Li, Chen (bib0030) 2024 Xu, Guo, Wang, Huang, Essa, Shi (bib0018) 2024 Hessel, Holtzman, Forbes, Le Bras, Choi (bib0037) 2021 Hu, Liu, Zhang, Li, Zhang, Jin, Wu (bib0013) 2022 Caron, Touvron, Misra, Jégou, Mairal, Bojanowski, Joulin (bib0038) 2021 Zhang, Rao, Agrawala (bib0019) 2023 H. Ye, J. Zhang, S. Liu, X. Han, W. Yang, IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models, (2023). arXiv preprint S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al., Grounding DINO: marrying DINO with grounded pre-training for open-set object detection, (2023). arXiv preprint Van Den Oord, Vinyals (bib0014) 2017; 30 Rombach, Blattmann, Lorenz, Esser, Ommer (bib0003) 2022 Sun, Wang, Zhu, Liu (bib0021) 2024; 146 Khatun, Denman, Sridharan, Fookes (bib0023) 2023; 137 Li, Liu, Liu, Feng, Li, Liu, Chen, Shao, Yuan (bib0024) 2024 Saharia, Chan, Saxena, Li, Whang, Denton, Ghasemipour, Gontijo Lopes, Karagol Ayan, Salimans (bib0002) 2022; 35 Lu, Zhou, Bao, Chen, Li, Zhu (bib0033) 2022; 35 Ruiz (10.1016/j.patcog.2025.112111_bib0004) 2023 Sun (10.1016/j.patcog.2025.112111_bib0021) 2024; 146 Li (10.1016/j.patcog.2025.112111_bib0029) 2023; 34.8 Li (10.1016/j.patcog.2025.112111_bib0007) 2024; 36 Hu (10.1016/j.patcog.2025.112111_bib0013) 2022 Ho (10.1016/j.patcog.2025.112111_bib0031) 2020; 33 Xu (10.1016/j.patcog.2025.112111_bib0018) 2024 Li (10.1016/j.patcog.2025.112111_bib0024) 2024 10.1016/j.patcog.2025.112111_bib0039 Esser (10.1016/j.patcog.2025.112111_bib0015) 2021 Lu (10.1016/j.patcog.2025.112111_bib0033) 2022; 35 Cherti (10.1016/j.patcog.2025.112111_bib0035) 2023 Chefer (10.1016/j.patcog.2025.112111_bib0034) 2023; 42 Kumari (10.1016/j.patcog.2025.112111_bib0006) 2023 10.1016/j.patcog.2025.112111_bib0010 Xu (10.1016/j.patcog.2025.112111_bib0009) 2024; 36 Shi (10.1016/j.patcog.2025.112111_bib0008) 2024 Han (10.1016/j.patcog.2025.112111_bib0028) 2023 Hessel (10.1016/j.patcog.2025.112111_bib0037) 2021 Caron (10.1016/j.patcog.2025.112111_bib0038) 2021 Chen (10.1016/j.patcog.2025.112111_bib0022) 2024 Rombach (10.1016/j.patcog.2025.112111_bib0003) 2022 Khatun (10.1016/j.patcog.2025.112111_bib0023) 2023; 137 Isola (10.1016/j.patcog.2025.112111_bib0012) 2017 Van Den Oord (10.1016/j.patcog.2025.112111_bib0014) 2017; 30 Couairon (10.1016/j.patcog.2025.112111_bib0027) 2023 Zhang (10.1016/j.patcog.2025.112111_bib0019) 2023 Gal (10.1016/j.patcog.2025.112111_bib0005) 2023 Radford (10.1016/j.patcog.2025.112111_bib0026) 2021 Tan (10.1016/j.patcog.2025.112111_bib0011) 2023; 144 Saharia (10.1016/j.patcog.2025.112111_bib0002) 2022; 35 10.1016/j.patcog.2025.112111_bib0020 Yu (10.1016/j.patcog.2025.112111_bib0030) 2024 Li (10.1016/j.patcog.2025.112111_bib0016) 2024 Betker (10.1016/j.patcog.2025.112111_sbref0038) 2023; 2 Nichol (10.1016/j.patcog.2025.112111_bib0001) 2022 Podell (10.1016/j.patcog.2025.112111_bib0017) 2024 Song (10.1016/j.patcog.2025.112111_bib0032) 2021 Devlin (10.1016/j.patcog.2025.112111_bib0040) 2019 Feng (10.1016/j.patcog.2025.112111_bib0025) 2023 Schuhmann (10.1016/j.patcog.2025.112111_bib0036) 2022; 35 |
References_xml | – start-page: 4171 year: 2019 end-page: 4186 ident: bib0040 article-title: BERT: pre-training of deep bidirectional transformers for language understanding publication-title: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) – volume: 30 year: 2017 ident: bib0014 article-title: Neural discrete representation learning publication-title: Adv. Neural Inf. Process. Syst. – start-page: 8543 year: 2024 end-page: 8552 ident: bib0008 article-title: InstantBooth: personalized text-to-image generation without test-time finetuning publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – volume: 36 year: 2024 ident: bib0009 article-title: ImageReward: learning and evaluating human preferences for text-to-image generation publication-title: Adv. Neural Inf. Process. Syst. – start-page: 8682 year: 2024 end-page: 8692 ident: bib0018 article-title: Prompt-free diffusion: taking” text” out of text-to-image diffusion models publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – volume: 35 start-page: 5775 year: 2022 end-page: 5787 ident: bib0033 article-title: DPM-Solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps publication-title: Adv. Neural Inf. Process. Syst. – year: 2021 ident: bib0032 article-title: Denoising diffusion implicit models publication-title: International Conference on Learning Representations – start-page: 2818 year: 2023 end-page: 2829 ident: bib0035 article-title: Reproducible scaling laws for contrastive language-image learning publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – volume: 33 start-page: 6840 year: 2020 end-page: 6851 ident: bib0031 article-title: Denoising diffusion probabilistic models publication-title: Adv. Neural Inf. Process. Syst. – start-page: 3836 year: 2023 end-page: 3847 ident: bib0019 article-title: Adding conditional control to text-to-image diffusion models publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision – volume: 35 start-page: 36479 year: 2022 end-page: 36494 ident: bib0002 article-title: Photorealistic text-to-image diffusion models with deep language understanding publication-title: Adv. Neural Inf. Process. Syst. – start-page: 16784 year: 2022 end-page: 16804 ident: bib0001 article-title: GLIDE: towards photorealistic image generation and editing with text-guided diffusion Models publication-title: International Conference on Machine Learning – start-page: 10684 year: 2022 end-page: 10695 ident: bib0003 article-title: High-resolution image synthesis with latent diffusion models publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – volume: 146 year: 2024 ident: bib0021 article-title: Reparameterizing and dynamically quantizing image features for image generation publication-title: Pattern Recognit. – start-page: 1931 year: 2023 end-page: 1941 ident: bib0006 article-title: Multi-concept customization of text-to-image diffusion publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – volume: 34.8 start-page: 6833 year: 2023 end-page: 6846 ident: bib0029 article-title: AGIQA-3K: an open database for ai-generated image quality assessment publication-title: IEEE Trans. Circuits Syst. Video Technol. – year: 2023 ident: bib0025 article-title: Training-free structured diffusion guidance for compositional text-to-image synthesis publication-title: International Conference on Learning Representations – volume: 137 year: 2023 ident: bib0023 article-title: Pose-driven attention-guided image generation for person re-identification publication-title: Pattern Recognit. – start-page: 22500 year: 2023 end-page: 22510 ident: bib0004 article-title: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – start-page: 12873 year: 2021 end-page: 12883 ident: bib0015 article-title: Taming transformers for high-resolution image synthesis publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – start-page: 230 year: 2024 end-page: 240 ident: bib0024 article-title: Endora: video generation models as endoscopy simulators publication-title: International Conference on Medical Image Computing and Computer-Assisted Intervention – reference: S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al., Grounding DINO: marrying DINO with grounded pre-training for open-set object detection, (2023). arXiv preprint – start-page: 7323 year: 2023 end-page: 7334 ident: bib0028 article-title: SVDiff: Compact parameter space for diffusion fine-tuning publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision – volume: 2 start-page: 8 year: 2023 ident: bib0041 article-title: Improving image generation with better captions publication-title: Comput. Sci. – start-page: 5343 year: 2024 end-page: 5353 ident: bib0022 article-title: Training-free layout control with cross-attention guidance publication-title: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision – start-page: 6692 year: 2024 end-page: 6701 ident: bib0030 article-title: SF-IQA: quality and similarity integration for ai generated image quality assessment publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – start-page: 15014 year: 2022 end-page: 15023 ident: bib0013 article-title: Protecting facial privacy: Generating adversarial identity masks via style-robust makeup transfer publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition – start-page: 7514 year: 2021 end-page: 7528 ident: bib0037 article-title: CLIPScore: a reference-free evaluation metric for image captioning publication-title: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing – year: 2024 ident: bib0016 article-title: CLIFF: continual latent diffusion for open-vocabulary object detection publication-title: European Conference on Computer Vision – year: 2024 ident: bib0017 article-title: SDXL: improving latent diffusion models for high-resolution image synthesis publication-title: International Conference on Learning Representations – volume: 36 year: 2024 ident: bib0007 article-title: BLIP-Diffusion: Pre-trained subject representation for controllable text-to-image generation and editing publication-title: Adv. Neural Inf. Process. Syst. – reference: . – volume: 42 start-page: 1 year: 2023 end-page: 10 ident: bib0034 article-title: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models publication-title: ACM Trans. Graphics (TOG) – start-page: 9650 year: 2021 end-page: 9660 ident: bib0038 article-title: Emerging properties in self-supervised vision transformers publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision – year: 2023 ident: bib0005 article-title: An image is worth one word: personalizing text-to-image generation using textual inversion publication-title: International Conference on Learning Representations – reference: H. Ye, J. Zhang, S. Liu, X. Han, W. Yang, IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models, (2023). arXiv preprint – volume: 35 start-page: 25278 year: 2022 end-page: 25294 ident: bib0036 article-title: LAION-5B: an open large-scale dataset for training next generation image-text models publication-title: Adv. Neural Inf. Process. Syst. – volume: 144 year: 2023 ident: bib0011 article-title: Semantic similarity distance: towards better text-image consistency metric in text-to-image generation publication-title: Pattern Recognit. – start-page: 1125 year: 2017 end-page: 1134 ident: bib0012 article-title: Image-to-image translation with conditional adversarial networks publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition – start-page: 2174 year: 2023 end-page: 2183 ident: bib0027 article-title: Zero-shot spatial layout conditioning for text-to-image diffusion models publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision – reference: X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, H. Li, Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis, (2023). arXiv preprint – start-page: 8748 year: 2021 end-page: 8763 ident: bib0026 article-title: Learning transferable visual models from natural language supervision publication-title: International Conference on Machine Learning – start-page: 7323 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0028 article-title: SVDiff: Compact parameter space for diffusion fine-tuning – start-page: 6692 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0030 article-title: SF-IQA: quality and similarity integration for ai generated image quality assessment – volume: 33 start-page: 6840 year: 2020 ident: 10.1016/j.patcog.2025.112111_bib0031 article-title: Denoising diffusion probabilistic models publication-title: Adv. Neural Inf. Process. Syst. – start-page: 22500 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0004 article-title: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation – start-page: 5343 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0022 article-title: Training-free layout control with cross-attention guidance – year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0025 article-title: Training-free structured diffusion guidance for compositional text-to-image synthesis – volume: 36 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0007 article-title: BLIP-Diffusion: Pre-trained subject representation for controllable text-to-image generation and editing publication-title: Adv. Neural Inf. Process. Syst. – start-page: 10684 year: 2022 ident: 10.1016/j.patcog.2025.112111_bib0003 article-title: High-resolution image synthesis with latent diffusion models – volume: 146 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0021 article-title: Reparameterizing and dynamically quantizing image features for image generation publication-title: Pattern Recognit. doi: 10.1016/j.patcog.2023.109962 – start-page: 8543 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0008 article-title: InstantBooth: personalized text-to-image generation without test-time finetuning – start-page: 8682 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0018 article-title: Prompt-free diffusion: taking” text” out of text-to-image diffusion models – start-page: 9650 year: 2021 ident: 10.1016/j.patcog.2025.112111_bib0038 article-title: Emerging properties in self-supervised vision transformers – start-page: 2174 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0027 article-title: Zero-shot spatial layout conditioning for text-to-image diffusion models – start-page: 1931 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0006 article-title: Multi-concept customization of text-to-image diffusion – start-page: 8748 year: 2021 ident: 10.1016/j.patcog.2025.112111_bib0026 article-title: Learning transferable visual models from natural language supervision – start-page: 12873 year: 2021 ident: 10.1016/j.patcog.2025.112111_bib0015 article-title: Taming transformers for high-resolution image synthesis – start-page: 2818 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0035 article-title: Reproducible scaling laws for contrastive language-image learning – volume: 42 start-page: 1 issue: 4 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0034 article-title: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models publication-title: ACM Trans. Graphics (TOG) doi: 10.1145/3592116 – volume: 35 start-page: 5775 year: 2022 ident: 10.1016/j.patcog.2025.112111_bib0033 article-title: DPM-Solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps publication-title: Adv. Neural Inf. Process. Syst. – volume: 35 start-page: 36479 year: 2022 ident: 10.1016/j.patcog.2025.112111_bib0002 article-title: Photorealistic text-to-image diffusion models with deep language understanding publication-title: Adv. Neural Inf. Process. Syst. – start-page: 3836 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0019 article-title: Adding conditional control to text-to-image diffusion models – year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0017 article-title: SDXL: improving latent diffusion models for high-resolution image synthesis – ident: 10.1016/j.patcog.2025.112111_bib0020 – volume: 137 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0023 article-title: Pose-driven attention-guided image generation for person re-identification publication-title: Pattern Recognit. doi: 10.1016/j.patcog.2022.109246 – volume: 34.8 start-page: 6833 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0029 article-title: AGIQA-3K: an open database for ai-generated image quality assessment publication-title: IEEE Trans. Circuits Syst. Video Technol. – volume: 36 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0009 article-title: ImageReward: learning and evaluating human preferences for text-to-image generation publication-title: Adv. Neural Inf. Process. Syst. – start-page: 16784 year: 2022 ident: 10.1016/j.patcog.2025.112111_bib0001 article-title: GLIDE: towards photorealistic image generation and editing with text-guided diffusion Models – ident: 10.1016/j.patcog.2025.112111_bib0010 – ident: 10.1016/j.patcog.2025.112111_bib0039 doi: 10.1007/978-3-031-72970-6_3 – year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0016 article-title: CLIFF: continual latent diffusion for open-vocabulary object detection – start-page: 230 year: 2024 ident: 10.1016/j.patcog.2025.112111_bib0024 article-title: Endora: video generation models as endoscopy simulators – year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0005 article-title: An image is worth one word: personalizing text-to-image generation using textual inversion – start-page: 1125 year: 2017 ident: 10.1016/j.patcog.2025.112111_bib0012 article-title: Image-to-image translation with conditional adversarial networks – year: 2021 ident: 10.1016/j.patcog.2025.112111_bib0032 article-title: Denoising diffusion implicit models – start-page: 7514 year: 2021 ident: 10.1016/j.patcog.2025.112111_bib0037 article-title: CLIPScore: a reference-free evaluation metric for image captioning – start-page: 4171 year: 2019 ident: 10.1016/j.patcog.2025.112111_bib0040 article-title: BERT: pre-training of deep bidirectional transformers for language understanding – start-page: 15014 year: 2022 ident: 10.1016/j.patcog.2025.112111_bib0013 article-title: Protecting facial privacy: Generating adversarial identity masks via style-robust makeup transfer – volume: 2 start-page: 8 issue: 3 year: 2023 ident: 10.1016/j.patcog.2025.112111_sbref0038 article-title: Improving image generation with better captions publication-title: Comput. Sci. – volume: 144 year: 2023 ident: 10.1016/j.patcog.2025.112111_bib0011 article-title: Semantic similarity distance: towards better text-image consistency metric in text-to-image generation publication-title: Pattern Recognit. doi: 10.1016/j.patcog.2023.109883 – volume: 30 year: 2017 ident: 10.1016/j.patcog.2025.112111_bib0014 article-title: Neural discrete representation learning publication-title: Adv. Neural Inf. Process. Syst. – volume: 35 start-page: 25278 year: 2022 ident: 10.1016/j.patcog.2025.112111_bib0036 article-title: LAION-5B: an open large-scale dataset for training next generation image-text models publication-title: Adv. Neural Inf. Process. Syst. |
SSID | ssj0017142 |
Score | 2.4880333 |
Snippet | •Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a... |
SourceID | crossref elsevier |
SourceType | Index Database Publisher |
StartPage | 112111 |
SubjectTerms | Compositional generation Diffusion model Subject-driven generation |
Title | Training-free subject-enhanced attention guidance for compositional text-to-image generation |
URI | https://dx.doi.org/10.1016/j.patcog.2025.112111 |
Volume | 170 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELYqWFh4I8qj8sBqmoeD47GqqAqITkXqgBT5WYpEWpV05bdzlzgIJMTAaMenRF_seyTf3RFyZbEIOJglZpNMM87TiEltFFMK_Feeyzyp-6c8Tm7GT_x-ls06ZNjmwiCtMuj-RqfX2jrM9AOa_dVigTm-WHYwSsGIR0LEM8xg5wJ3-fXHF80D-3s3FcPTmOHqNn2u5nitQN0t5xAlJhnm0sRx_Lt5-mZyRvtkN_iKdNA8zgHpuPKQ7LV9GGg4lkfkeRr6PDC_do6-bzR-XGGufKl_71MsoVmTGul8s7A4R8FVpcgmD5QtuAsyQFi1ZIs30DB0XlejxkvHZDq6nQ7HLHRNYAaOX8UUV3kipPSx5DZ2DuKZ2HihFc98bo2w1ptURon0kTZJImG5xlwdGSmtVJaekK1yWbpTQq1PhBXeywiEuYYY3Brv8tzkLrcw6hLWYlWsmtoYRUsaey0abAvEtmiw7RLRAlr8eMcFqO8_Jc_-LXlOdmAUeNYXZKtab9wluBGV7tX7pEe2B3cP48knWtLJyw |
linkProvider | Elsevier |
linkToHtml | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwED6VdoCFN-KNB1arceLgeEQVqKWPqUgdkCI7tkuRaKvS_n_OiYNAQgyMsXNKdInvYX_3HcCt8STg6JaoiVNNOU8iKnWhqFIYv_JMZnHZP2U4uus-86dJOmlAp66F8bDKYPsrm15a6zDSDtpsL2czX-PraQejBJ14JASbbEHLs1OlTWjd9_rd0ddhgmC8Ig1PGPUCdQVdCfNaosVbTDFRjFNfTsMY-91DffM6j_uwG8JFcl-90QE07PwQ9upWDCSszCN4GYdWD9StrCUfG-33V6idv5Yn_MSzaJa4RjLdzIwfIxitEg8oD6gtfIoHgdD1gs7e0ciQaUlI7aeOYfz4MO50aWicQAtcgWuquMpiIaVjkhtmLaY0rHBCK566zBTCGFckMoqli3QRxxJv175cR0ZKK5UmJ9CcL-b2FIhxsTDCORmhMNeYhpvC2SwrMpsZvDoDWusqX1b0GHmNG3vLK93mXrd5pdszELVC8x-fOUcL_qfk-b8lb2C7Ox4O8kFv1L-AHZwJsOtLaK5XG3uFUcVaX4e_5hNQ18x8 |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Training-free+subject-enhanced+attention+guidance+for+compositional+text-to-image+generation&rft.jtitle=Pattern+recognition&rft.au=Liu%2C+Shengyuan&rft.au=Wang%2C+Bo&rft.au=Ma%2C+Ye&rft.au=Yang%2C+Te&rft.date=2026-02-01&rft.issn=0031-3203&rft.volume=170&rft.spage=112111&rft_id=info:doi/10.1016%2Fj.patcog.2025.112111&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_patcog_2025_112111 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0031-3203&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0031-3203&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0031-3203&client=summon |