Training-free subject-enhanced attention guidance for compositional text-to-image generation

•Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a novel evaluation metric GroundingScore for comprehensive assessment. [Display omitted] Existing subject-driven text-to-image generation models s...

Full description

Saved in:
Bibliographic Details
Published inPattern recognition Vol. 170; p. 112111
Main Authors Liu, Shengyuan, Wang, Bo, Ma, Ye, Yang, Te, Chen, Quan, Dong, Di
Format Journal Article
LanguageEnglish
Published Elsevier Ltd 01.02.2026
Subjects
Online AccessGet full text

Cover

Loading…
Abstract •Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a novel evaluation metric GroundingScore for comprehensive assessment. [Display omitted] Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel GroundingScore metric to thoroughly assess subject alignment. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method.
AbstractList •Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a novel evaluation metric GroundingScore for comprehensive assessment. [Display omitted] Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel GroundingScore metric to thoroughly assess subject alignment. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method.
ArticleNumber 112111
Author Wang, Bo
Chen, Quan
Ma, Ye
Yang, Te
Dong, Di
Liu, Shengyuan
Author_xml – sequence: 1
  givenname: Shengyuan
  orcidid: 0000-0003-2317-3783
  surname: Liu
  fullname: Liu, Shengyuan
  email: liushengyuan2021@ia.ac.cn
  organization: The Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong, 999077, Hong Kong SAR, China
– sequence: 2
  givenname: Bo
  orcidid: 0000-0001-8848-3497
  surname: Wang
  fullname: Wang, Bo
  email: wangbo0060@163.com
  organization: Kuaishou Technology, Beijing, 100000, China
– sequence: 3
  givenname: Ye
  surname: Ma
  fullname: Ma, Ye
  email: maye@kuaishou.com
  organization: Kuaishou Technology, Beijing, 100000, China
– sequence: 4
  givenname: Te
  surname: Yang
  fullname: Yang, Te
  email: yangte2021@ia.ac.cn
  organization: Institute of Automation, Chinese Academy of Sciences, Beijing, 100000, China
– sequence: 5
  givenname: Quan
  orcidid: 0000-0002-4865-2396
  surname: Chen
  fullname: Chen, Quan
  email: myctllmail@163.com
  organization: Kuaishou Technology, Beijing, 100000, China
– sequence: 6
  givenname: Di
  surname: Dong
  fullname: Dong, Di
  email: di.dong@ia.ac.cn
  organization: Institute of Automation, Chinese Academy of Sciences, Beijing, 100000, China
BookMark eNp9kM1qwzAQhHVIoUnaN-jBLyBXK_9Fl0IJ_YNALz4WhCyvXJlECpJS2revjXvuaWFnZ5j9NmTlvENC7oDlwKC-H_OzStoPOWe8ygE4AKzImrECaMFZcU02MY6MQQMlX5OPNijrrBuoCYhZvHQj6kTRfSqnsc9USuiS9S4bLrafd5nxIdP-dPbRzoI6Zgm_E02e2pMaMBvQYVCzdEOujDpGvP2bW9I-P7X7V3p4f3nbPx6o5lWTqCrVjjdCGBBlD4iirkGbplNlZXa9bvre6EIwLgzrNOdiOu_qhjHBVKdUVWxJucTq4GMMaOQ5TFXCjwQmZyhylAsUOUORC5TJ9rDYcKr2ZTHIqC3OX9swMZC9t_8H_AJ97XLr
Cites_doi 10.1016/j.patcog.2023.109962
10.1145/3592116
10.1016/j.patcog.2022.109246
10.1007/978-3-031-72970-6_3
10.1016/j.patcog.2023.109883
ContentType Journal Article
Copyright 2025 Elsevier Ltd
Copyright_xml – notice: 2025 Elsevier Ltd
DBID AAYXX
CITATION
DOI 10.1016/j.patcog.2025.112111
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList
DeliveryMethod fulltext_linktorsrc
Discipline Computer Science
ExternalDocumentID 10_1016_j_patcog_2025_112111
S003132032500771X
GroupedDBID --K
--M
-D8
-DT
-~X
.DC
.~1
0R~
123
1B1
1RT
1~.
1~5
29O
4.4
457
4G.
53G
5VS
7-5
71M
8P~
9JN
AABNK
AAEDT
AAEDW
AAIKJ
AAKOC
AALRI
AAOAW
AAQFI
AAQXK
AATTM
AAXKI
AAXUO
AAYFN
AAYWO
ABBOA
ABDPE
ABEFU
ABFNM
ABFRF
ABHFT
ABJNI
ABMAC
ABWVN
ABXDB
ACBEA
ACDAQ
ACGFO
ACGFS
ACNNM
ACRLP
ACRPL
ACVFH
ACZNC
ADBBV
ADCNI
ADEZE
ADJOM
ADMUD
ADMXK
ADNMO
ADTZH
AEBSH
AECPX
AEFWE
AEIPS
AEKER
AENEX
AEUPX
AFJKZ
AFPUW
AFTJW
AGCQF
AGHFR
AGQPQ
AGUBO
AGYEJ
AHHHB
AHJVU
AHZHX
AIALX
AIEXJ
AIGII
AIIUN
AIKHN
AITUG
AKBMS
AKRWK
AKYEP
ALMA_UNASSIGNED_HOLDINGS
AMRAJ
ANKPU
AOUOD
APXCP
ASPBG
AVWKF
AXJTR
AZFZN
BJAXD
BKOJK
BLXMC
CS3
DU5
EBS
EFJIC
EFKBS
EJD
EO8
EO9
EP2
EP3
F0J
F5P
FD6
FDB
FEDTE
FGOYB
FIRID
FNPLU
FYGXN
G-Q
GBLVA
GBOLZ
HLZ
HVGLF
HZ~
H~9
IHE
J1W
JJJVA
KOM
KZ1
LG9
LMP
LY1
M41
MO0
N9A
O-L
O9-
OAUVE
OZT
P-8
P-9
P2P
PC.
Q38
R2-
RIG
RNS
ROL
RPZ
SBC
SDF
SDG
SDP
SDS
SES
SEW
SPC
SPCBC
SST
SSV
SSZ
T5K
TN5
UNMZH
VOH
WUQ
XJE
XPP
ZMT
ZY4
~G-
AAYXX
AFXIZ
AGRNS
BNPGV
CITATION
SSH
ID FETCH-LOGICAL-c257t-a4a82799f194d1ee9661cf7ba45f8dc7ddfc39029f0bc2294a8b670090abaa53
IEDL.DBID .~1
ISSN 0031-3203
IngestDate Thu Jul 24 02:15:52 EDT 2025
Sat Aug 16 17:00:39 EDT 2025
IsPeerReviewed true
IsScholarly true
Keywords Compositional generation
Subject-driven generation
Diffusion model
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-c257t-a4a82799f194d1ee9661cf7ba45f8dc7ddfc39029f0bc2294a8b670090abaa53
ORCID 0000-0001-8848-3497
0000-0003-2317-3783
0000-0002-4865-2396
ParticipantIDs crossref_primary_10_1016_j_patcog_2025_112111
elsevier_sciencedirect_doi_10_1016_j_patcog_2025_112111
PublicationCentury 2000
PublicationDate February 2026
2026-02-00
PublicationDateYYYYMMDD 2026-02-01
PublicationDate_xml – month: 02
  year: 2026
  text: February 2026
PublicationDecade 2020
PublicationTitle Pattern recognition
PublicationYear 2026
Publisher Elsevier Ltd
Publisher_xml – name: Elsevier Ltd
References Feng, He, Fu, Jampani, Akula, Narayana, Basu, Wang, Wang (bib0025) 2023
Ruiz, Li, Jampani, Pritch, Rubinstein, Aberman (bib0004) 2023
Song, Meng, Ermon (bib0032) 2021
Li, Li, Hoi (bib0007) 2024; 36
Shi, Xiong, Lin, Jung (bib0008) 2024
Kumari, Zhang, Zhang, Shechtman, Zhu (bib0006) 2023
Podell, English, Lacey, Blattmann, Dockhorn, Müller, Penna, Rombach (bib0017) 2024
Li, Zhang, Wu, Sun, Min, Liu, Zhai, Lin (bib0029) 2023; 34.8
Cherti, Beaumont, Wightman, Wortsman, Ilharco, Gordon, Schuhmann, Schmidt, Jitsev (bib0035) 2023
Xu, Liu, Wu, Tong, Li, Ding, Tang, Dong (bib0009) 2024; 36
Esser, Rombach, Ommer (bib0015) 2021
Han, Li, Zhang, Milanfar, Metaxas, Yang (bib0028) 2023
Chen, Laina, Vedaldi (bib0022) 2024
Radford, Kim, Hallacy, Ramesh, Goh, Agarwal, Sastry, Askell, Mishkin, Clark (bib0026) 2021
X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, H. Li, Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis, (2023). arXiv preprint
Chefer, Alaluf, Vinker, Wolf, Cohen-Or (bib0034) 2023; 42
Ho, Jain, Abbeel (bib0031) 2020; 33
Li, Liu, Yuan (bib0016) 2024
Tan, Yang, Ye, Wang, Yan, Nguyen, Huang (bib0011) 2023; 144
Schuhmann, Beaumont, Vencu, Gordon, Wightman, Cherti, Coombes, Katta, Mullis, Wortsman (bib0036) 2022; 35
Devlin, Chang, Lee, Toutanova (bib0040) 2019
Nichol, Dhariwal, Ramesh, Shyam, Mishkin, Mcgrew, Sutskever, Chen (bib0001) 2022
Couairon, Careil, Cord, Lathuilière, Verbeek (bib0027) 2023
Isola, Zhu, Zhou, Efros (bib0012) 2017
.
Betker, Goh, Jing, Brooks, Wang, Li, Ouyang, Zhuang, Lee, Guo (bib0041) 2023; 2
Gal, Alaluf, Atzmon, Patashnik, Bermano, Chechik, Cohen-or (bib0005) 2023
Yu, Guan, Lu, Li, Chen (bib0030) 2024
Xu, Guo, Wang, Huang, Essa, Shi (bib0018) 2024
Hessel, Holtzman, Forbes, Le Bras, Choi (bib0037) 2021
Hu, Liu, Zhang, Li, Zhang, Jin, Wu (bib0013) 2022
Caron, Touvron, Misra, Jégou, Mairal, Bojanowski, Joulin (bib0038) 2021
Zhang, Rao, Agrawala (bib0019) 2023
H. Ye, J. Zhang, S. Liu, X. Han, W. Yang, IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models, (2023). arXiv preprint
S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al., Grounding DINO: marrying DINO with grounded pre-training for open-set object detection, (2023). arXiv preprint
Van Den Oord, Vinyals (bib0014) 2017; 30
Rombach, Blattmann, Lorenz, Esser, Ommer (bib0003) 2022
Sun, Wang, Zhu, Liu (bib0021) 2024; 146
Khatun, Denman, Sridharan, Fookes (bib0023) 2023; 137
Li, Liu, Liu, Feng, Li, Liu, Chen, Shao, Yuan (bib0024) 2024
Saharia, Chan, Saxena, Li, Whang, Denton, Ghasemipour, Gontijo Lopes, Karagol Ayan, Salimans (bib0002) 2022; 35
Lu, Zhou, Bao, Chen, Li, Zhu (bib0033) 2022; 35
Ruiz (10.1016/j.patcog.2025.112111_bib0004) 2023
Sun (10.1016/j.patcog.2025.112111_bib0021) 2024; 146
Li (10.1016/j.patcog.2025.112111_bib0029) 2023; 34.8
Li (10.1016/j.patcog.2025.112111_bib0007) 2024; 36
Hu (10.1016/j.patcog.2025.112111_bib0013) 2022
Ho (10.1016/j.patcog.2025.112111_bib0031) 2020; 33
Xu (10.1016/j.patcog.2025.112111_bib0018) 2024
Li (10.1016/j.patcog.2025.112111_bib0024) 2024
10.1016/j.patcog.2025.112111_bib0039
Esser (10.1016/j.patcog.2025.112111_bib0015) 2021
Lu (10.1016/j.patcog.2025.112111_bib0033) 2022; 35
Cherti (10.1016/j.patcog.2025.112111_bib0035) 2023
Chefer (10.1016/j.patcog.2025.112111_bib0034) 2023; 42
Kumari (10.1016/j.patcog.2025.112111_bib0006) 2023
10.1016/j.patcog.2025.112111_bib0010
Xu (10.1016/j.patcog.2025.112111_bib0009) 2024; 36
Shi (10.1016/j.patcog.2025.112111_bib0008) 2024
Han (10.1016/j.patcog.2025.112111_bib0028) 2023
Hessel (10.1016/j.patcog.2025.112111_bib0037) 2021
Caron (10.1016/j.patcog.2025.112111_bib0038) 2021
Chen (10.1016/j.patcog.2025.112111_bib0022) 2024
Rombach (10.1016/j.patcog.2025.112111_bib0003) 2022
Khatun (10.1016/j.patcog.2025.112111_bib0023) 2023; 137
Isola (10.1016/j.patcog.2025.112111_bib0012) 2017
Van Den Oord (10.1016/j.patcog.2025.112111_bib0014) 2017; 30
Couairon (10.1016/j.patcog.2025.112111_bib0027) 2023
Zhang (10.1016/j.patcog.2025.112111_bib0019) 2023
Gal (10.1016/j.patcog.2025.112111_bib0005) 2023
Radford (10.1016/j.patcog.2025.112111_bib0026) 2021
Tan (10.1016/j.patcog.2025.112111_bib0011) 2023; 144
Saharia (10.1016/j.patcog.2025.112111_bib0002) 2022; 35
10.1016/j.patcog.2025.112111_bib0020
Yu (10.1016/j.patcog.2025.112111_bib0030) 2024
Li (10.1016/j.patcog.2025.112111_bib0016) 2024
Betker (10.1016/j.patcog.2025.112111_sbref0038) 2023; 2
Nichol (10.1016/j.patcog.2025.112111_bib0001) 2022
Podell (10.1016/j.patcog.2025.112111_bib0017) 2024
Song (10.1016/j.patcog.2025.112111_bib0032) 2021
Devlin (10.1016/j.patcog.2025.112111_bib0040) 2019
Feng (10.1016/j.patcog.2025.112111_bib0025) 2023
Schuhmann (10.1016/j.patcog.2025.112111_bib0036) 2022; 35
References_xml – start-page: 4171
  year: 2019
  end-page: 4186
  ident: bib0040
  article-title: BERT: pre-training of deep bidirectional transformers for language understanding
  publication-title: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers)
– volume: 30
  year: 2017
  ident: bib0014
  article-title: Neural discrete representation learning
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 8543
  year: 2024
  end-page: 8552
  ident: bib0008
  article-title: InstantBooth: personalized text-to-image generation without test-time finetuning
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– volume: 36
  year: 2024
  ident: bib0009
  article-title: ImageReward: learning and evaluating human preferences for text-to-image generation
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 8682
  year: 2024
  end-page: 8692
  ident: bib0018
  article-title: Prompt-free diffusion: taking” text” out of text-to-image diffusion models
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– volume: 35
  start-page: 5775
  year: 2022
  end-page: 5787
  ident: bib0033
  article-title: DPM-Solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps
  publication-title: Adv. Neural Inf. Process. Syst.
– year: 2021
  ident: bib0032
  article-title: Denoising diffusion implicit models
  publication-title: International Conference on Learning Representations
– start-page: 2818
  year: 2023
  end-page: 2829
  ident: bib0035
  article-title: Reproducible scaling laws for contrastive language-image learning
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– volume: 33
  start-page: 6840
  year: 2020
  end-page: 6851
  ident: bib0031
  article-title: Denoising diffusion probabilistic models
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 3836
  year: 2023
  end-page: 3847
  ident: bib0019
  article-title: Adding conditional control to text-to-image diffusion models
  publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision
– volume: 35
  start-page: 36479
  year: 2022
  end-page: 36494
  ident: bib0002
  article-title: Photorealistic text-to-image diffusion models with deep language understanding
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 16784
  year: 2022
  end-page: 16804
  ident: bib0001
  article-title: GLIDE: towards photorealistic image generation and editing with text-guided diffusion Models
  publication-title: International Conference on Machine Learning
– start-page: 10684
  year: 2022
  end-page: 10695
  ident: bib0003
  article-title: High-resolution image synthesis with latent diffusion models
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– volume: 146
  year: 2024
  ident: bib0021
  article-title: Reparameterizing and dynamically quantizing image features for image generation
  publication-title: Pattern Recognit.
– start-page: 1931
  year: 2023
  end-page: 1941
  ident: bib0006
  article-title: Multi-concept customization of text-to-image diffusion
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– volume: 34.8
  start-page: 6833
  year: 2023
  end-page: 6846
  ident: bib0029
  article-title: AGIQA-3K: an open database for ai-generated image quality assessment
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
– year: 2023
  ident: bib0025
  article-title: Training-free structured diffusion guidance for compositional text-to-image synthesis
  publication-title: International Conference on Learning Representations
– volume: 137
  year: 2023
  ident: bib0023
  article-title: Pose-driven attention-guided image generation for person re-identification
  publication-title: Pattern Recognit.
– start-page: 22500
  year: 2023
  end-page: 22510
  ident: bib0004
  article-title: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– start-page: 12873
  year: 2021
  end-page: 12883
  ident: bib0015
  article-title: Taming transformers for high-resolution image synthesis
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– start-page: 230
  year: 2024
  end-page: 240
  ident: bib0024
  article-title: Endora: video generation models as endoscopy simulators
  publication-title: International Conference on Medical Image Computing and Computer-Assisted Intervention
– reference: S. Liu, Z. Zeng, T. Ren, F. Li, H. Zhang, J. Yang, C. Li, J. Yang, H. Su, J. Zhu, et al., Grounding DINO: marrying DINO with grounded pre-training for open-set object detection, (2023). arXiv preprint
– start-page: 7323
  year: 2023
  end-page: 7334
  ident: bib0028
  article-title: SVDiff: Compact parameter space for diffusion fine-tuning
  publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision
– volume: 2
  start-page: 8
  year: 2023
  ident: bib0041
  article-title: Improving image generation with better captions
  publication-title: Comput. Sci.
– start-page: 5343
  year: 2024
  end-page: 5353
  ident: bib0022
  article-title: Training-free layout control with cross-attention guidance
  publication-title: Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
– start-page: 6692
  year: 2024
  end-page: 6701
  ident: bib0030
  article-title: SF-IQA: quality and similarity integration for ai generated image quality assessment
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– start-page: 15014
  year: 2022
  end-page: 15023
  ident: bib0013
  article-title: Protecting facial privacy: Generating adversarial identity masks via style-robust makeup transfer
  publication-title: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
– start-page: 7514
  year: 2021
  end-page: 7528
  ident: bib0037
  article-title: CLIPScore: a reference-free evaluation metric for image captioning
  publication-title: Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing
– year: 2024
  ident: bib0016
  article-title: CLIFF: continual latent diffusion for open-vocabulary object detection
  publication-title: European Conference on Computer Vision
– year: 2024
  ident: bib0017
  article-title: SDXL: improving latent diffusion models for high-resolution image synthesis
  publication-title: International Conference on Learning Representations
– volume: 36
  year: 2024
  ident: bib0007
  article-title: BLIP-Diffusion: Pre-trained subject representation for controllable text-to-image generation and editing
  publication-title: Adv. Neural Inf. Process. Syst.
– reference: .
– volume: 42
  start-page: 1
  year: 2023
  end-page: 10
  ident: bib0034
  article-title: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models
  publication-title: ACM Trans. Graphics (TOG)
– start-page: 9650
  year: 2021
  end-page: 9660
  ident: bib0038
  article-title: Emerging properties in self-supervised vision transformers
  publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision
– year: 2023
  ident: bib0005
  article-title: An image is worth one word: personalizing text-to-image generation using textual inversion
  publication-title: International Conference on Learning Representations
– reference: H. Ye, J. Zhang, S. Liu, X. Han, W. Yang, IP-Adapter: text compatible image prompt adapter for text-to-image diffusion models, (2023). arXiv preprint
– volume: 35
  start-page: 25278
  year: 2022
  end-page: 25294
  ident: bib0036
  article-title: LAION-5B: an open large-scale dataset for training next generation image-text models
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 144
  year: 2023
  ident: bib0011
  article-title: Semantic similarity distance: towards better text-image consistency metric in text-to-image generation
  publication-title: Pattern Recognit.
– start-page: 1125
  year: 2017
  end-page: 1134
  ident: bib0012
  article-title: Image-to-image translation with conditional adversarial networks
  publication-title: Proceedings of the IEEE conference on computer vision and pattern recognition
– start-page: 2174
  year: 2023
  end-page: 2183
  ident: bib0027
  article-title: Zero-shot spatial layout conditioning for text-to-image diffusion models
  publication-title: Proceedings of the IEEE/CVF International Conference on Computer Vision
– reference: X. Wu, Y. Hao, K. Sun, Y. Chen, F. Zhu, R. Zhao, H. Li, Human preference score v2: a solid benchmark for evaluating human preferences of text-to-image synthesis, (2023). arXiv preprint
– start-page: 8748
  year: 2021
  end-page: 8763
  ident: bib0026
  article-title: Learning transferable visual models from natural language supervision
  publication-title: International Conference on Machine Learning
– start-page: 7323
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0028
  article-title: SVDiff: Compact parameter space for diffusion fine-tuning
– start-page: 6692
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0030
  article-title: SF-IQA: quality and similarity integration for ai generated image quality assessment
– volume: 33
  start-page: 6840
  year: 2020
  ident: 10.1016/j.patcog.2025.112111_bib0031
  article-title: Denoising diffusion probabilistic models
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 22500
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0004
  article-title: DreamBooth: fine tuning text-to-image diffusion models for subject-driven generation
– start-page: 5343
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0022
  article-title: Training-free layout control with cross-attention guidance
– year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0025
  article-title: Training-free structured diffusion guidance for compositional text-to-image synthesis
– volume: 36
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0007
  article-title: BLIP-Diffusion: Pre-trained subject representation for controllable text-to-image generation and editing
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 10684
  year: 2022
  ident: 10.1016/j.patcog.2025.112111_bib0003
  article-title: High-resolution image synthesis with latent diffusion models
– volume: 146
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0021
  article-title: Reparameterizing and dynamically quantizing image features for image generation
  publication-title: Pattern Recognit.
  doi: 10.1016/j.patcog.2023.109962
– start-page: 8543
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0008
  article-title: InstantBooth: personalized text-to-image generation without test-time finetuning
– start-page: 8682
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0018
  article-title: Prompt-free diffusion: taking” text” out of text-to-image diffusion models
– start-page: 9650
  year: 2021
  ident: 10.1016/j.patcog.2025.112111_bib0038
  article-title: Emerging properties in self-supervised vision transformers
– start-page: 2174
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0027
  article-title: Zero-shot spatial layout conditioning for text-to-image diffusion models
– start-page: 1931
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0006
  article-title: Multi-concept customization of text-to-image diffusion
– start-page: 8748
  year: 2021
  ident: 10.1016/j.patcog.2025.112111_bib0026
  article-title: Learning transferable visual models from natural language supervision
– start-page: 12873
  year: 2021
  ident: 10.1016/j.patcog.2025.112111_bib0015
  article-title: Taming transformers for high-resolution image synthesis
– start-page: 2818
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0035
  article-title: Reproducible scaling laws for contrastive language-image learning
– volume: 42
  start-page: 1
  issue: 4
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0034
  article-title: Attend-and-excite: attention-based semantic guidance for text-to-image diffusion models
  publication-title: ACM Trans. Graphics (TOG)
  doi: 10.1145/3592116
– volume: 35
  start-page: 5775
  year: 2022
  ident: 10.1016/j.patcog.2025.112111_bib0033
  article-title: DPM-Solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 35
  start-page: 36479
  year: 2022
  ident: 10.1016/j.patcog.2025.112111_bib0002
  article-title: Photorealistic text-to-image diffusion models with deep language understanding
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 3836
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0019
  article-title: Adding conditional control to text-to-image diffusion models
– year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0017
  article-title: SDXL: improving latent diffusion models for high-resolution image synthesis
– ident: 10.1016/j.patcog.2025.112111_bib0020
– volume: 137
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0023
  article-title: Pose-driven attention-guided image generation for person re-identification
  publication-title: Pattern Recognit.
  doi: 10.1016/j.patcog.2022.109246
– volume: 34.8
  start-page: 6833
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0029
  article-title: AGIQA-3K: an open database for ai-generated image quality assessment
  publication-title: IEEE Trans. Circuits Syst. Video Technol.
– volume: 36
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0009
  article-title: ImageReward: learning and evaluating human preferences for text-to-image generation
  publication-title: Adv. Neural Inf. Process. Syst.
– start-page: 16784
  year: 2022
  ident: 10.1016/j.patcog.2025.112111_bib0001
  article-title: GLIDE: towards photorealistic image generation and editing with text-guided diffusion Models
– ident: 10.1016/j.patcog.2025.112111_bib0010
– ident: 10.1016/j.patcog.2025.112111_bib0039
  doi: 10.1007/978-3-031-72970-6_3
– year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0016
  article-title: CLIFF: continual latent diffusion for open-vocabulary object detection
– start-page: 230
  year: 2024
  ident: 10.1016/j.patcog.2025.112111_bib0024
  article-title: Endora: video generation models as endoscopy simulators
– year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0005
  article-title: An image is worth one word: personalizing text-to-image generation using textual inversion
– start-page: 1125
  year: 2017
  ident: 10.1016/j.patcog.2025.112111_bib0012
  article-title: Image-to-image translation with conditional adversarial networks
– year: 2021
  ident: 10.1016/j.patcog.2025.112111_bib0032
  article-title: Denoising diffusion implicit models
– start-page: 7514
  year: 2021
  ident: 10.1016/j.patcog.2025.112111_bib0037
  article-title: CLIPScore: a reference-free evaluation metric for image captioning
– start-page: 4171
  year: 2019
  ident: 10.1016/j.patcog.2025.112111_bib0040
  article-title: BERT: pre-training of deep bidirectional transformers for language understanding
– start-page: 15014
  year: 2022
  ident: 10.1016/j.patcog.2025.112111_bib0013
  article-title: Protecting facial privacy: Generating adversarial identity masks via style-robust makeup transfer
– volume: 2
  start-page: 8
  issue: 3
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_sbref0038
  article-title: Improving image generation with better captions
  publication-title: Comput. Sci.
– volume: 144
  year: 2023
  ident: 10.1016/j.patcog.2025.112111_bib0011
  article-title: Semantic similarity distance: towards better text-image consistency metric in text-to-image generation
  publication-title: Pattern Recognit.
  doi: 10.1016/j.patcog.2023.109883
– volume: 30
  year: 2017
  ident: 10.1016/j.patcog.2025.112111_bib0014
  article-title: Neural discrete representation learning
  publication-title: Adv. Neural Inf. Process. Syst.
– volume: 35
  start-page: 25278
  year: 2022
  ident: 10.1016/j.patcog.2025.112111_bib0036
  article-title: LAION-5B: an open large-scale dataset for training next generation image-text models
  publication-title: Adv. Neural Inf. Process. Syst.
SSID ssj0017142
Score 2.4880333
Snippet •Propose a zero-shot diffusion-based framework for subject-driven generation task.•Introduce a training-free subject-enhanced attention guidance.•Propose a...
SourceID crossref
elsevier
SourceType Index Database
Publisher
StartPage 112111
SubjectTerms Compositional generation
Diffusion model
Subject-driven generation
Title Training-free subject-enhanced attention guidance for compositional text-to-image generation
URI https://dx.doi.org/10.1016/j.patcog.2025.112111
Volume 170
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELYqWFh4I8qj8sBqmoeD47GqqAqITkXqgBT5WYpEWpV05bdzlzgIJMTAaMenRF_seyTf3RFyZbEIOJglZpNMM87TiEltFFMK_Feeyzyp-6c8Tm7GT_x-ls06ZNjmwiCtMuj-RqfX2jrM9AOa_dVigTm-WHYwSsGIR0LEM8xg5wJ3-fXHF80D-3s3FcPTmOHqNn2u5nitQN0t5xAlJhnm0sRx_Lt5-mZyRvtkN_iKdNA8zgHpuPKQ7LV9GGg4lkfkeRr6PDC_do6-bzR-XGGufKl_71MsoVmTGul8s7A4R8FVpcgmD5QtuAsyQFi1ZIs30DB0XlejxkvHZDq6nQ7HLHRNYAaOX8UUV3kipPSx5DZ2DuKZ2HihFc98bo2w1ptURon0kTZJImG5xlwdGSmtVJaekK1yWbpTQq1PhBXeywiEuYYY3Brv8tzkLrcw6hLWYlWsmtoYRUsaey0abAvEtmiw7RLRAlr8eMcFqO8_Jc_-LXlOdmAUeNYXZKtab9wluBGV7tX7pEe2B3cP48knWtLJyw
linkProvider Elsevier
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwED6VdoCFN-KNB1arceLgeEQVqKWPqUgdkCI7tkuRaKvS_n_OiYNAQgyMsXNKdInvYX_3HcCt8STg6JaoiVNNOU8iKnWhqFIYv_JMZnHZP2U4uus-86dJOmlAp66F8bDKYPsrm15a6zDSDtpsL2czX-PraQejBJ14JASbbEHLs1OlTWjd9_rd0ddhgmC8Ig1PGPUCdQVdCfNaosVbTDFRjFNfTsMY-91DffM6j_uwG8JFcl-90QE07PwQ9upWDCSszCN4GYdWD9StrCUfG-33V6idv5Yn_MSzaJa4RjLdzIwfIxitEg8oD6gtfIoHgdD1gs7e0ciQaUlI7aeOYfz4MO50aWicQAtcgWuquMpiIaVjkhtmLaY0rHBCK566zBTCGFckMoqli3QRxxJv175cR0ZKK5UmJ9CcL-b2FIhxsTDCORmhMNeYhpvC2SwrMpsZvDoDWusqX1b0GHmNG3vLK93mXrd5pdszELVC8x-fOUcL_qfk-b8lb2C7Ox4O8kFv1L-AHZwJsOtLaK5XG3uFUcVaX4e_5hNQ18x8
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Training-free+subject-enhanced+attention+guidance+for+compositional+text-to-image+generation&rft.jtitle=Pattern+recognition&rft.au=Liu%2C+Shengyuan&rft.au=Wang%2C+Bo&rft.au=Ma%2C+Ye&rft.au=Yang%2C+Te&rft.date=2026-02-01&rft.issn=0031-3203&rft.volume=170&rft.spage=112111&rft_id=info:doi/10.1016%2Fj.patcog.2025.112111&rft.externalDBID=n%2Fa&rft.externalDocID=10_1016_j_patcog_2025_112111
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=0031-3203&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=0031-3203&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=0031-3203&client=summon