Instruct-Imagen: Image Generation with Multi-modal Instruction

This paper presents Instruct-Imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal in-struction for image generation, a task representation artic-ulating a range of generation intents with precision. It uses natural language t...

Full description

Saved in:
Bibliographic Details
Published inProceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 4754 - 4763
Main Authors Hu, Hexiang, Chan, Kelvin C.K., Su, Yu-Chuan, Chen, Wenhu, Li, Yandong, Sohn, Kihyuk, Zhao, Yang, Ben, Xue, Gong, Boqing, Cohen, William, Chang, Ming-Wei, Jia, Xuhui
Format Conference Proceeding
LanguageEnglish
Published IEEE 16.06.2024
Subjects
Online AccessGet full text
ISSN1063-6919
DOI10.1109/CVPR52733.2024.00455

Cover

More Information
Summary:This paper presents Instruct-Imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal in-struction for image generation, a task representation artic-ulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build Instruct - Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multi-modal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets re-veals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available.
ISSN:1063-6919
DOI:10.1109/CVPR52733.2024.00455