Learned Representation-Guided Diffusion Models for Large-Image Generation

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized do-mains like histopathology and satellite imagery; it is of-ten perform...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Vol. 2024; pp. 8532 - 8542
Main Authors	Graikos, Alexandros, Yellapragada, Srikar, Le, Minh-Quan, Kapse, Saarthak, Prasanna, Prateek, Saltz, Joel, Samaras, Dimitris
Format	Conference Proceeding Journal Article
Language	English
Published	United States IEEE 01.06.2024
Subjects	Adaptation models Annotations Computational modeling diffusion models generative models Histopathology Image synthesis remote sensing Training Visualization
Online Access	Get full text

Cover

Loading…

More Information
Summary:	To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized do-mains like histopathology and satellite imagery; it is of-ten performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by as-sembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Gen-erating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. 1 1 Code is available at this link
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 Equal contribution.
ISSN:	1063-6919 1063-6919
DOI:	10.1109/CVPR52733.2024.00815