Dimension-Reduction Attack! Video Generative Models are Experts on Controllable Image Synthesis
Video generative models can be regarded as world simulators due to their ability to capture dynamic, continuous changes inherent in real-world environments. These models integrate high-dimensional information across visual, temporal, spatial, and causal dimensions, enabling predictions of subjects i...
Saved in:
Main Authors | , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
29.05.2025
|
Subjects | |
Online Access | Get full text |
DOI | 10.48550/arxiv.2505.23325 |
Cover
Summary: | Video generative models can be regarded as world simulators due to their
ability to capture dynamic, continuous changes inherent in real-world
environments. These models integrate high-dimensional information across
visual, temporal, spatial, and causal dimensions, enabling predictions of
subjects in various status. A natural and valuable research direction is to
explore whether a fully trained video generative model in high-dimensional
space can effectively support lower-dimensional tasks such as controllable
image generation. In this work, we propose a paradigm for video-to-image
knowledge compression and task adaptation, termed \textit{Dimension-Reduction
Attack} (\texttt{DRA-Ctrl}), which utilizes the strengths of video models,
including long-range context modeling and flatten full-attention, to perform
various generation tasks. Specially, to address the challenging gap between
continuous video frames and discrete image generation, we introduce a
mixup-based transition strategy that ensures smooth adaptation. Moreover, we
redesign the attention structure with a tailored masking mechanism to better
align text prompts with image-level control. Experiments across diverse image
generation tasks, such as subject-driven and spatially conditioned generation,
show that repurposed video models outperform those trained directly on images.
These results highlight the untapped potential of large-scale video generators
for broader visual applications. \texttt{DRA-Ctrl} provides new insights into
reusing resource-intensive video models and lays foundation for future unified
generative models across visual modalities. The project page is
https://dra-ctrl-2025.github.io/DRA-Ctrl/. |
---|---|
DOI: | 10.48550/arxiv.2505.23325 |