Learned Representation-Guided Diffusion Models for Large-Image Generation

To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized do-mains like histopathology and satellite imagery; it is of-ten perform...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) Vol. 2024; pp. 8532 - 8542
Main Authors Graikos, Alexandros, Yellapragada, Srikar, Le, Minh-Quan, Kapse, Saarthak, Prasanna, Prateek, Saltz, Joel, Samaras, Dimitris
Format Conference Proceeding Journal Article
LanguageEnglish
Published United States IEEE 01.06.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:To synthesize high-fidelity samples, diffusion models typically require auxiliary data to guide the generation process. However, it is impractical to procure the painstaking patch-level annotation effort required in specialized do-mains like histopathology and satellite imagery; it is of-ten performed by domain experts and involves hundreds of millions of patches. Modern-day self-supervised learning (SSL) representations encode rich semantic and visual information. In this paper, we posit that such representations are expressive enough to act as proxies to fine-grained human labels. We introduce a novel approach that trains diffusion models conditioned on embeddings from SSL. Our diffusion models successfully project these features back to high-quality histopathology and remote sensing images. In addition, we construct larger images by as-sembling spatially consistent patches inferred from SSL embeddings, preserving long-range dependencies. Augmenting real data by generating variations of real images improves downstream classifier accuracy for patch-level and larger, image-scale classification tasks. Our models are effective even on datasets not encountered during training, demonstrating their robustness and generalizability. Gen-erating images from learned embeddings is agnostic to the source of the embeddings. The SSL embeddings used to generate a large image can either be extracted from a reference image, or sampled from an auxiliary model conditioned on any related modality (e.g. class labels, text, genomic data). As proof of concept, we introduce the text-to-large image synthesis paradigm where we successfully synthesize large pathology and satellite images out of text descriptions. 1 1 Code is available at this link
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
Equal contribution.
ISSN:1063-6919
1063-6919
DOI:10.1109/CVPR52733.2024.00815