Faster Image2Video Generation: A Closer Look at CLIP Image Embedding's Impact on Spatio-Temporal Cross-Attentions

This paper explores the effectiveness-specifically in improving video consistency-and the computational burden of Contrastive Language-Image Pre-Training (CLIP) embeddings in video generation. The investigation is conducted using the Stable Video Diffusion (SVD) framework, a state-of-the-art method...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 13; pp. 141313 - 141327
Main Authors	Taghipour, Ashkan, Ghahremani, Morteza, Bennamoun, Mohammed, Miri Rekavandi, Aref, Li, Zinuo, Laga, Hamid, Boussaid, Farid
Format	Journal Article
Language	English
Published	IEEE 2025
Subjects	Australia CLIP image encoding Computational modeling Computer architecture Diffusion models image-to-video generation Noise reduction spatial cross-attention temporal-cross-attention Text to video Three-dimensional displays Training Video generation Videos Visualization
Online Access	Get full text

Cover

Loading…

Be the first to leave a comment!