ConsistencyTTA: Accelerating Diffusion-Based Text-to-Audio Generation with Consistency Distillation
Diffusion models are instrumental in text-to-audio (TTA) generation. Unfortunately, they suffer from slow inference due to an excessive number of queries to the underlying denoising network per generation. To address this bottleneck, we introduce ConsistencyTTA, a framework requiring only a single n...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
19.09.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Diffusion models are instrumental in text-to-audio (TTA) generation.
Unfortunately, they suffer from slow inference due to an excessive number of
queries to the underlying denoising network per generation. To address this
bottleneck, we introduce ConsistencyTTA, a framework requiring only a single
non-autoregressive network query, thereby accelerating TTA by hundreds of
times. We achieve so by proposing "CFG-aware latent consistency model," which
adapts consistency generation into a latent space and incorporates
classifier-free guidance (CFG) into model training. Moreover, unlike diffusion
models, ConsistencyTTA can be finetuned closed-loop with audio-space text-aware
metrics, such as CLAP score, to further enhance the generations. Our objective
and subjective evaluation on the AudioCaps dataset shows that compared to
diffusion-based counterparts, ConsistencyTTA reduces inference computation by
400x while retaining generation quality and diversity. |
---|---|
DOI: | 10.48550/arxiv.2309.10740 |