High-fidelity Person-centric Subject-to-Image Synthesis

Current subject-driven image generation methods en-counter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely,...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 7675 - 7684
Main Authors Wang, Yibin, Zhang, Weizhong, Zheng, Jianwei, Jin, Cheng
Format Conference Proceeding
LanguageEnglish
Published IEEE 16.06.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Current subject-driven image generation methods en-counter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbal-ance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner, all of which can be seamlessly integrated into the DDIM sampling process. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser in gener-ating high-fidelity person images depicting multiple unseen persons with varying contexts. Code is available at https://github.com/CodeGoat24/Face-diffuser.
AbstractList Current subject-driven image generation methods en-counter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbal-ance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner, all of which can be seamlessly integrated into the DDIM sampling process. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser in gener-ating high-fidelity person images depicting multiple unseen persons with varying contexts. Code is available at https://github.com/CodeGoat24/Face-diffuser.
Author Wang, Yibin
Zheng, Jianwei
Jin, Cheng
Zhang, Weizhong
Author_xml – sequence: 1
  givenname: Yibin
  surname: Wang
  fullname: Wang, Yibin
  email: yibinwang1121@163.com
  organization: School of Computer Science, Fudan University
– sequence: 2
  givenname: Weizhong
  surname: Zhang
  fullname: Zhang, Weizhong
  email: weizhongzhang@fudan.edu.cn
  organization: School of Data Science, Fudan University
– sequence: 3
  givenname: Jianwei
  surname: Zheng
  fullname: Zheng, Jianwei
  email: zjw@zjut.edu.cn
  organization: College of Computer Science and Technology, Zhejiang University of Technology
– sequence: 4
  givenname: Cheng
  surname: Jin
  fullname: Jin, Cheng
  email: jc@fudan.edu.cn
  organization: School of Computer Science, Fudan University
BookMark eNotj8tOwzAQRQ0CiVLyB13kBxxm4jiOlygCWqkSFQW2lR_j1lWboMQs8vdUgtU9q6Nz79lN13fE2AKhQAT92H5t3mWphChKKKsC4IJXLNNKN0KCkAKgvmYzhFrwWqO-Y9k4HgFAlIi1bmZMLeP-wEP0dIppyjc0jH3HHXVpiC7f_tgjucRTz1dns6d8O3XpQGMcH9htMKeRsv-ds8-X5492yddvr6v2ac0jqjpxqajREkwgRE8N-sY0wSoyujKovbpUKiudd5UQ1nuLiiqvg7GBLDoMYs4Wf95IRLvvIZ7NMO0uh6SsdSV-AYkjStI
CODEN IEEPAD
ContentType Conference Proceeding
DBID 6IE
6IH
CBEJK
RIE
RIO
DOI 10.1109/CVPR52733.2024.00733
DatabaseName IEEE Electronic Library (IEL) Conference Proceedings
IEEE Proceedings Order Plan (POP) 1998-present by volume
IEEE Xplore All Conference Proceedings
IEEE/IET Electronic Library
IEEE Proceedings Order Plans (POP) 1998-present
DatabaseTitleList
Database_xml – sequence: 1
  dbid: RIE
  name: IEEE Electronic Library (IEL)
  url: https://proxy.k.utb.cz/login?url=https://ieeexplore.ieee.org/
  sourceTypes: Publisher
DeliveryMethod fulltext_linktorsrc
Discipline Applied Sciences
EISBN 9798350353006
EISSN 1063-6919
EndPage 7684
ExternalDocumentID 10655694
Genre orig-research
GroupedDBID 6IE
6IH
6IL
6IN
AAWTH
ABLEC
ADZIZ
ALMA_UNASSIGNED_HOLDINGS
BEFXN
BFFAM
BGNUA
BKEBE
BPEOZ
CBEJK
CHZPO
IEGSK
IJVOP
OCL
RIE
RIL
RIO
ID FETCH-LOGICAL-i176t-57e8950afe11de81d8a8fb7ea94a19d73507b5cdc433bddb17e4d9fabfeb1c1f3
IEDL.DBID RIE
IngestDate Wed Aug 27 02:01:00 EDT 2025
IsPeerReviewed false
IsScholarly true
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-i176t-57e8950afe11de81d8a8fb7ea94a19d73507b5cdc433bddb17e4d9fabfeb1c1f3
PageCount 10
ParticipantIDs ieee_primary_10655694
PublicationCentury 2000
PublicationDate 2024-June-16
PublicationDateYYYYMMDD 2024-06-16
PublicationDate_xml – month: 06
  year: 2024
  text: 2024-June-16
  day: 16
PublicationDecade 2020
PublicationTitle Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online)
PublicationTitleAbbrev CVPR
PublicationYear 2024
Publisher IEEE
Publisher_xml – name: IEEE
SSID ssj0003211698
Score 2.286454
Snippet Current subject-driven image generation methods en-counter significant challenges in person-centric image generation. The reason is that they learn the...
SourceID ieee
SourceType Publisher
StartPage 7675
SubjectTerms Collaboration
Diffusion
Image synthesis
Noise
Pipelines
Predictive models
Semantics
Subject-to-Image Synthesis
Text-to-Image Synthesis
Training
Title High-fidelity Person-centric Subject-to-Image Synthesis
URI https://ieeexplore.ieee.org/document/10655694
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LSwMxEA62J0_1UfHNHrymbprX5lwsVbAUtdJb2ckDitiK3R701ztJt1UEwduSyy4Jmflm9vu-IeSKKVxHpIHITQsqrBIUpCqphaAtDyx0k-PN_VANxuJuIie1WD1pYbz3iXzmO_Ex_ct3C7uKrTK84UpKZUSDNLByW4u1tg0VjqWMMkUtj2O5ue49jx6ivxjHMrCbTLLjeNwfQ1RSDum3yHDz9jV15KWzqqBjP38ZM_778_ZI-1uul422iWif7Pj5AWnV-DKrb-_ykOhI6qAhGlsh9s5GCWzTRM-c2QxDSOzJ0GpBb18xymSPH3NEh8vZsk3G_Zun3oDWgxPojGlVUal9YWReBs-Y84hIi7IIoH1pRMmM0xxBIEjrrOAcnAOmvXAmlBAwclsW-BFpzhdzf0wyaaTNQYPiYBC7SACRBywqIXeB89yckHbciOnb2htjutmD0z_Wz8huPIxItmLqnDSr95W_wLRewWU6zi-oV6Ji
linkProvider IEEE
linkToHtml http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwjV1LTwIxEG4UD3rCB8a3e_Ba3NIXPRMJKBCiYLiR7SshxsXIctBf77QsaExMvG162U2bznwz-33fIHRDBKwD0gDkJhlmRjCsuciw0V4a6olvRMeb_kB0xux-wielWD1qYZxzkXzm6uEx_su3c7MMrTK44YJzodg22oHEz8lKrrVpqVAoZoRqlgI5kqrb1vPwMTiMUSgEG9EmOwzI_TFGJWaRdhUN1u9fkUde6stC183nL2vGf3_gPqp9C_aS4SYVHaAtlx-iaokwk_L-Lo6QDLQO7IO1FaDvZBjhNo4EzZlJIIiErgwu5rj7CnEmefrIAR8uZosaGrfvRq0OLkcn4BmRosBcuqbiaeYdIdYBJm1mTa-lyxTLiLKSAgzU3FjDKNXWaiIds8pn2kPsNsTTY1TJ57k7QQlX3KRaakG1AvTCtWaph7JSp9ZTmqpTVAsbMX1buWNM13tw9sf6NdrtjPq9aa87eDhHe-FgAvWKiAtUKd6X7hKSfKGv4tF-AVKxpas
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Abook&rft.genre=proceeding&rft.title=Proceedings+%28IEEE+Computer+Society+Conference+on+Computer+Vision+and+Pattern+Recognition.+Online%29&rft.atitle=High-fidelity+Person-centric+Subject-to-Image+Synthesis&rft.au=Wang%2C+Yibin&rft.au=Zhang%2C+Weizhong&rft.au=Zheng%2C+Jianwei&rft.au=Jin%2C+Cheng&rft.date=2024-06-16&rft.pub=IEEE&rft.eissn=1063-6919&rft.spage=7675&rft.epage=7684&rft_id=info:doi/10.1109%2FCVPR52733.2024.00733&rft.externalDocID=10655694