CV-VAE: A Compatible Video VAE for Latent Generative Video Models
Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the...
Saved in:
Main Authors | , , , , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
30.05.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Abstract | Spatio-temporal compression of videos, utilizing networks such as Variational
Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other
video generative models. For instance, many LLM-like video models learn the
distribution of discrete tokens derived from 3D VAEs within the VQVAE
framework, while most diffusion-based video models capture the distribution of
continuous latent extracted by 2D VAEs without quantization. The temporal
compression is simply realized by uniform frame sampling which results in
unsmooth motion between consecutive frames. Currently, there lacks of a
commonly used continuous video (3D) VAE for latent diffusion-based video models
in the research community. Moreover, since current diffusion-based approaches
are often implemented using pre-trained text-to-image (T2I) models, directly
training a video VAE without considering the compatibility with existing T2I
models will result in a latent space gap between them, which will take huge
computational resources for training to bridge the gap even with the T2I models
as initialization. To address this issue, we propose a method for training a
video VAE of latent video models, namely CV-VAE, whose latent space is
compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion
(SD). The compatibility is achieved by the proposed novel latent space
regularization, which involves formulating a regularization loss using the
image VAE. Benefiting from the latent space compatibility, video models can be
trained seamlessly from pre-trained T2I or video models in a truly
spatio-temporally compressed latent space, rather than simply sampling video
frames at equal intervals. With our CV-VAE, existing video models can generate
four times more frames with minimal finetuning. Extensive experiments are
conducted to demonstrate the effectiveness of the proposed video VAE. |
---|---|
AbstractList | Spatio-temporal compression of videos, utilizing networks such as Variational
Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other
video generative models. For instance, many LLM-like video models learn the
distribution of discrete tokens derived from 3D VAEs within the VQVAE
framework, while most diffusion-based video models capture the distribution of
continuous latent extracted by 2D VAEs without quantization. The temporal
compression is simply realized by uniform frame sampling which results in
unsmooth motion between consecutive frames. Currently, there lacks of a
commonly used continuous video (3D) VAE for latent diffusion-based video models
in the research community. Moreover, since current diffusion-based approaches
are often implemented using pre-trained text-to-image (T2I) models, directly
training a video VAE without considering the compatibility with existing T2I
models will result in a latent space gap between them, which will take huge
computational resources for training to bridge the gap even with the T2I models
as initialization. To address this issue, we propose a method for training a
video VAE of latent video models, namely CV-VAE, whose latent space is
compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion
(SD). The compatibility is achieved by the proposed novel latent space
regularization, which involves formulating a regularization loss using the
image VAE. Benefiting from the latent space compatibility, video models can be
trained seamlessly from pre-trained T2I or video models in a truly
spatio-temporally compressed latent space, rather than simply sampling video
frames at equal intervals. With our CV-VAE, existing video models can generate
four times more frames with minimal finetuning. Extensive experiments are
conducted to demonstrate the effectiveness of the proposed video VAE. |
Author | Shan, Ying Hu, Wenbo Li, Xiaoyu Cun, Xiaodong Yang, Shaoshu Zhao, Sijie Niu, Muyao Zhang, Yong |
Author_xml | – sequence: 1 givenname: Sijie surname: Zhao fullname: Zhao, Sijie – sequence: 2 givenname: Yong surname: Zhang fullname: Zhang, Yong – sequence: 3 givenname: Xiaodong surname: Cun fullname: Cun, Xiaodong – sequence: 4 givenname: Shaoshu surname: Yang fullname: Yang, Shaoshu – sequence: 5 givenname: Muyao surname: Niu fullname: Niu, Muyao – sequence: 6 givenname: Xiaoyu surname: Li fullname: Li, Xiaoyu – sequence: 7 givenname: Wenbo surname: Hu fullname: Hu, Wenbo – sequence: 8 givenname: Ying surname: Shan fullname: Shan, Ying |
BackLink | https://doi.org/10.48550/arXiv.2405.20279$$DView paper in arXiv |
BookMark | eNo1j8tqwzAURLVIF2maD8iq-gG7elpydsakacElm-CtuZauwOBYQTGh_fumabsamDkMnEeymOKEhGw4y5XVmr1A-hyuuVBM54IJUy5JVbdZW-22tKJ1PJ1hHvoRaTt4jPTW0xATbWDGaaZ7nDDdgOv__hE9jpcn8hBgvOD6L1fk-Lo71m9Zc9i_11WTQWHKTDkZ0DvOnLXYc28LzzUCBO24AG5AKV6GQugAxjpjvZS9Un2pRVBWcS1X5Pn39u7QndNwgvTV_bh0dxf5DZGKQ-A |
ContentType | Journal Article |
Copyright | http://creativecommons.org/licenses/by/4.0 |
Copyright_xml | – notice: http://creativecommons.org/licenses/by/4.0 |
DBID | AKY GOX |
DOI | 10.48550/arxiv.2405.20279 |
DatabaseName | arXiv Computer Science arXiv.org |
DatabaseTitleList | |
Database_xml | – sequence: 1 dbid: GOX name: arXiv.org url: http://arxiv.org/find sourceTypes: Open Access Repository |
DeliveryMethod | fulltext_linktorsrc |
ExternalDocumentID | 2405_20279 |
GroupedDBID | AKY GOX |
ID | FETCH-LOGICAL-a679-4c3fedc10c88eb1d86d15eaaf5c12a17a4419f625fa78c78d33b44b952f484153 |
IEDL.DBID | GOX |
IngestDate | Fri Oct 25 20:37:45 EDT 2024 |
IsDoiOpenAccess | true |
IsOpenAccess | true |
IsPeerReviewed | false |
IsScholarly | false |
Language | English |
LinkModel | DirectLink |
MergedId | FETCHMERGED-LOGICAL-a679-4c3fedc10c88eb1d86d15eaaf5c12a17a4419f625fa78c78d33b44b952f484153 |
OpenAccessLink | https://arxiv.org/abs/2405.20279 |
ParticipantIDs | arxiv_primary_2405_20279 |
PublicationCentury | 2000 |
PublicationDate | 2024-05-30 |
PublicationDateYYYYMMDD | 2024-05-30 |
PublicationDate_xml | – month: 05 year: 2024 text: 2024-05-30 day: 30 |
PublicationDecade | 2020 |
PublicationYear | 2024 |
Score | 1.9203384 |
SecondaryResourceType | preprint |
Snippet | Spatio-temporal compression of videos, utilizing networks such as Variational
Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other... |
SourceID | arxiv |
SourceType | Open Access Repository |
SubjectTerms | Computer Science - Artificial Intelligence Computer Science - Computer Vision and Pattern Recognition |
Title | CV-VAE: A Compatible Video VAE for Latent Generative Video Models |
URI | https://arxiv.org/abs/2405.20279 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07TwMxDLbaTiwIBKg8lYH1oLlLLgnbqWqpEI-lnLpVeUqVqoLaA_HzcXKHYGGKZHuJo9if49gGuKZe2DxIlSnmZcZCfGiKgatBbMQMF9byWDv89FzOXtnDgi96QH5qYfT2a_XZ9gc2u1t0N_wmhueqD_08j1-27l8WbXIyteLq5H_lEGMm0h8nMT2A_Q7dkao9jkPo-c0RVOM6q6vJHalIun3Nyqw9qVfOvxGkE4SN5BEh36YhbRPoaIE6fhxVtt4dw3w6mY9nWTe5INOlUBmzRfDO0pGVEm2hk6Wj3GsduKW5pkIjBlEBI4-ghbRCuqIwjBnF88AketTiBAYY_PshkFiH6gprRAiozUJJY1gZnOPMUmrU6BSGab_L97Y5xTKqYplUcfY_6xz2cGUpCz66gEGz_fCX6Fwbc5U0_A0CynfT |
link.rule.ids | 228,230,783,888 |
linkProvider | Cornell University |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CV-VAE%3A+A+Compatible+Video+VAE+for+Latent+Generative+Video+Models&rft.au=Zhao%2C+Sijie&rft.au=Zhang%2C+Yong&rft.au=Cun%2C+Xiaodong&rft.au=Yang%2C+Shaoshu&rft.date=2024-05-30&rft_id=info:doi/10.48550%2Farxiv.2405.20279&rft.externalDocID=2405_20279 |