CV-VAE: A Compatible Video VAE for Latent Generative Video Models

Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the...

Full description

Saved in:
Bibliographic Details
Main Authors Zhao, Sijie, Zhang, Yong, Cun, Xiaodong, Yang, Shaoshu, Niu, Muyao, Li, Xiaoyu, Hu, Wenbo, Shan, Ying
Format Journal Article
LanguageEnglish
Published 30.05.2024
Subjects
Online AccessGet full text

Cover

Loading…
Abstract Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.
AbstractList Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other video generative models. For instance, many LLM-like video models learn the distribution of discrete tokens derived from 3D VAEs within the VQVAE framework, while most diffusion-based video models capture the distribution of continuous latent extracted by 2D VAEs without quantization. The temporal compression is simply realized by uniform frame sampling which results in unsmooth motion between consecutive frames. Currently, there lacks of a commonly used continuous video (3D) VAE for latent diffusion-based video models in the research community. Moreover, since current diffusion-based approaches are often implemented using pre-trained text-to-image (T2I) models, directly training a video VAE without considering the compatibility with existing T2I models will result in a latent space gap between them, which will take huge computational resources for training to bridge the gap even with the T2I models as initialization. To address this issue, we propose a method for training a video VAE of latent video models, namely CV-VAE, whose latent space is compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion (SD). The compatibility is achieved by the proposed novel latent space regularization, which involves formulating a regularization loss using the image VAE. Benefiting from the latent space compatibility, video models can be trained seamlessly from pre-trained T2I or video models in a truly spatio-temporally compressed latent space, rather than simply sampling video frames at equal intervals. With our CV-VAE, existing video models can generate four times more frames with minimal finetuning. Extensive experiments are conducted to demonstrate the effectiveness of the proposed video VAE.
Author Shan, Ying
Hu, Wenbo
Li, Xiaoyu
Cun, Xiaodong
Yang, Shaoshu
Zhao, Sijie
Niu, Muyao
Zhang, Yong
Author_xml – sequence: 1
  givenname: Sijie
  surname: Zhao
  fullname: Zhao, Sijie
– sequence: 2
  givenname: Yong
  surname: Zhang
  fullname: Zhang, Yong
– sequence: 3
  givenname: Xiaodong
  surname: Cun
  fullname: Cun, Xiaodong
– sequence: 4
  givenname: Shaoshu
  surname: Yang
  fullname: Yang, Shaoshu
– sequence: 5
  givenname: Muyao
  surname: Niu
  fullname: Niu, Muyao
– sequence: 6
  givenname: Xiaoyu
  surname: Li
  fullname: Li, Xiaoyu
– sequence: 7
  givenname: Wenbo
  surname: Hu
  fullname: Hu, Wenbo
– sequence: 8
  givenname: Ying
  surname: Shan
  fullname: Shan, Ying
BackLink https://doi.org/10.48550/arXiv.2405.20279$$DView paper in arXiv
BookMark eNo1j8tqwzAURLVIF2maD8iq-gG7elpydsakacElm-CtuZauwOBYQTGh_fumabsamDkMnEeymOKEhGw4y5XVmr1A-hyuuVBM54IJUy5JVbdZW-22tKJ1PJ1hHvoRaTt4jPTW0xATbWDGaaZ7nDDdgOv__hE9jpcn8hBgvOD6L1fk-Lo71m9Zc9i_11WTQWHKTDkZ0DvOnLXYc28LzzUCBO24AG5AKV6GQugAxjpjvZS9Un2pRVBWcS1X5Pn39u7QndNwgvTV_bh0dxf5DZGKQ-A
ContentType Journal Article
Copyright http://creativecommons.org/licenses/by/4.0
Copyright_xml – notice: http://creativecommons.org/licenses/by/4.0
DBID AKY
GOX
DOI 10.48550/arxiv.2405.20279
DatabaseName arXiv Computer Science
arXiv.org
DatabaseTitleList
Database_xml – sequence: 1
  dbid: GOX
  name: arXiv.org
  url: http://arxiv.org/find
  sourceTypes: Open Access Repository
DeliveryMethod fulltext_linktorsrc
ExternalDocumentID 2405_20279
GroupedDBID AKY
GOX
ID FETCH-LOGICAL-a679-4c3fedc10c88eb1d86d15eaaf5c12a17a4419f625fa78c78d33b44b952f484153
IEDL.DBID GOX
IngestDate Fri Oct 25 20:37:45 EDT 2024
IsDoiOpenAccess true
IsOpenAccess true
IsPeerReviewed false
IsScholarly false
Language English
LinkModel DirectLink
MergedId FETCHMERGED-LOGICAL-a679-4c3fedc10c88eb1d86d15eaaf5c12a17a4419f625fa78c78d33b44b952f484153
OpenAccessLink https://arxiv.org/abs/2405.20279
ParticipantIDs arxiv_primary_2405_20279
PublicationCentury 2000
PublicationDate 2024-05-30
PublicationDateYYYYMMDD 2024-05-30
PublicationDate_xml – month: 05
  year: 2024
  text: 2024-05-30
  day: 30
PublicationDecade 2020
PublicationYear 2024
Score 1.9203384
SecondaryResourceType preprint
Snippet Spatio-temporal compression of videos, utilizing networks such as Variational Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other...
SourceID arxiv
SourceType Open Access Repository
SubjectTerms Computer Science - Artificial Intelligence
Computer Science - Computer Vision and Pattern Recognition
Title CV-VAE: A Compatible Video VAE for Latent Generative Video Models
URI https://arxiv.org/abs/2405.20279
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwdV07TwMxDLbaTiwIBKg8lYH1oLlLLgnbqWqpEI-lnLpVeUqVqoLaA_HzcXKHYGGKZHuJo9if49gGuKZe2DxIlSnmZcZCfGiKgatBbMQMF9byWDv89FzOXtnDgi96QH5qYfT2a_XZ9gc2u1t0N_wmhueqD_08j1-27l8WbXIyteLq5H_lEGMm0h8nMT2A_Q7dkao9jkPo-c0RVOM6q6vJHalIun3Nyqw9qVfOvxGkE4SN5BEh36YhbRPoaIE6fhxVtt4dw3w6mY9nWTe5INOlUBmzRfDO0pGVEm2hk6Wj3GsduKW5pkIjBlEBI4-ghbRCuqIwjBnF88AketTiBAYY_PshkFiH6gprRAiozUJJY1gZnOPMUmrU6BSGab_L97Y5xTKqYplUcfY_6xz2cGUpCz66gEGz_fCX6Fwbc5U0_A0CynfT
link.rule.ids 228,230,783,888
linkProvider Cornell University
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=CV-VAE%3A+A+Compatible+Video+VAE+for+Latent+Generative+Video+Models&rft.au=Zhao%2C+Sijie&rft.au=Zhang%2C+Yong&rft.au=Cun%2C+Xiaodong&rft.au=Yang%2C+Shaoshu&rft.date=2024-05-30&rft_id=info:doi/10.48550%2Farxiv.2405.20279&rft.externalDocID=2405_20279