Instruct-Imagen: Image Generation with Multi-modal Instruction

This paper presents Instruct-Imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal in-struction for image generation, a task representation artic-ulating a range of generation intents with precision. It uses natural language t...

Full description

Saved in:

Bibliographic Details
Published in	Proceedings (IEEE Computer Society Conference on Computer Vision and Pattern Recognition. Online) pp. 4754 - 4763
Main Authors	Hu, Hexiang, Chan, Kelvin C.K., Su, Yu-Chuan, Chen, Wenhu, Li, Yandong, Sohn, Kihyuk, Zhao, Yang, Ben, Xue, Gong, Boqing, Cohen, William, Chang, Ming-Wei, Jia, Xuhui
Format	Conference Proceeding
Language	English
Published	IEEE 16.06.2024
Subjects	Adaptation models Computer vision Diffusion Model Generalization to Unseen Tasks Image edge detection Image Editing Image Generation Image synthesis Instruction Tuning Multi-modal Foundation Models Natural languages Text to image Training
Online Access	Get full text
ISSN	1063-6919
DOI	10.1109/CVPR52733.2024.00455

Cover

More Information
Summary:	This paper presents Instruct-Imagen, a model that tackles heterogeneous image generation tasks and generalizes across unseen tasks. We introduce multi-modal in-struction for image generation, a task representation artic-ulating a range of generation intents with precision. It uses natural language to amalgamate disparate modalities (e.g., text, edge, style, subject, etc.), such that abundant generation intents can be standardized in a uniform format. We then build Instruct - Imagen by fine-tuning a pre-trained text-to-image diffusion model with two stages. First, we adapt the model using the retrieval-augmented training, to enhance model's capabilities to ground its generation on external multi-modal context. Subsequently, we fine-tune the adapted model on diverse image generation tasks that requires vision-language understanding (e.g., subject-driven generation, etc.), each paired with a multi-modal instruction encapsulating the task's essence. Human evaluation on various image generation datasets re-veals that Instruct-Imagen matches or surpasses prior task-specific models in-domain and demonstrates promising generalization to unseen and more complex tasks. Our evaluation suite will be made publicly available.
ISSN:	1063-6919
DOI:	10.1109/CVPR52733.2024.00455