Text2LIVE: Text-Driven Layered Image and Video Editing

We present a method for zero-shot, text-driven editing of natural images and videos. Given an image or a video and a text prompt, our goal is to edit the appearance of existing objects (e.g., texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantic manner. We train a gener...

Full description

Saved in:

Bibliographic Details
Published in	Computer Vision - ECCV 2022 Vol. 13675; pp. 707 - 723
Main Authors	Bar-Tal, Omer, Ofri-Amar, Dolev, Fridman, Rafail, Kasten, Yoni, Dekel, Tali
Format	Book Chapter
Language	English
Published	Switzerland Springer 2022 Springer Nature Switzerland
Series	Lecture Notes in Computer Science
Subjects	CLIP Text-guided image and video editing
Online Access	Get full text

Cover

Loading…

More Information
Summary:	We present a method for zero-shot, text-driven editing of natural images and videos. Given an image or a video and a text prompt, our goal is to edit the appearance of existing objects (e.g., texture) or augment the scene with visual effects (e.g., smoke, fire) in a semantic manner. We train a generator on an internal dataset, extracted from a single input, while leveraging an external pretrained CLIP model to impose our losses. Rather than directly generating the edited output, our key idea is to generate an edit layer (color+opacity) that is composited over the input. This allows us to control the generation and maintain high fidelity to the input via novel text-driven losses applied directly to the edit layer. Our method neither relies on a pretrained generator nor requires user-provided masks. We demonstrate localized, semantic edits on high-resolution images and videos across a variety of objects and scenes. Webpage: http://www.text2live.github.io.
Bibliography:	O. Bar-Tal, D. Ofri-Amar and R. Fridman—Have contributed equally.
ISBN:	3031197836 9783031197833
ISSN:	0302-9743 1611-3349
DOI:	10.1007/978-3-031-19784-0_41