Beyond Inserting: Learning Subject Embedding for Semantic-Fidelity Personalized Diffusion Generation

Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or the...

Full description

Saved in:

Bibliographic Details
Published in	IEEE transactions on circuits and systems for video technology p. 1
Main Authors	Li, Yang, Yang, Songlin, Wang, Wei, Dong, Jing
Format	Journal Article
Language	English
Published	IEEE 2025
Subjects	Accuracy Circuits and systems Diffusion models Face recognition Faces Generative Models Image synthesis Optimization Personalized Generation Text to image Text-to-Image Generation Training Transformers
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Text-to-Image (T2I) personalization based on advanced diffusion models (e.g., Stable Diffusion), which aims to generate images of target subjects given various prompts, has drawn huge attention. However, when users require personalized image generation for specific subjects such as themselves or their pet cat, the T2I models fail to accurately generate their subject-preserved images. The main problem is that pre-trained T2I models do not learn the T2I mapping between the target subjects and their corresponding visual contents. Even if multiple target subject images are provided, previous personalization methods either failed to accurately fit the subject region or lost the interactive generative ability with other existing concepts in T2I model space. For example, they are unable to generate T2I-aligned and semantic-fidelity images for the given prompts with other concepts such as scenes ("Eiffel Tower"), actions ("holding a basketball"), and facial attributes ("eyes closed"). In this paper, we focus on inserting accurate and interactive subject embedding into the Stable Diffusion Model for semantic-fidelity personalized generation using one image. We address this challenge from two perspectives: subject-wise attention loss and semantic-fidelity token optimization. Specifically, we propose a subject-wise attention loss to guide the subject embedding onto a manifold with high subject identity similarity and diverse interactive generative ability. Then, we optimize one subject representation as multiple per-stage tokens, and each token contains two disentangled features. This expansion of the textual conditioning space enhances the semantic control, thereby improving semantic-fidelity. We conduct extensive experiments on the most challenging subjects, face identities, to validate that our results exhibit superior subject accuracy and fine-grained manipulation ability. We further validate the generalization of our methods on various non-face subjects.
ISSN:	1051-8215 1558-2205
DOI:	10.1109/TCSVT.2025.3588882