Audio-Journey: Open Domain Latent Diffusion Based Text-To-Audio Generation

Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 6960 - 6964
Main Authors	Michaels, Jackson, Li, Juncheng B, Yao, Laura, Yu, Lijun, Wood-Doughty, Zach, Metze, Florian
Format	Conference Proceeding
Language	English
Published	IEEE 14.04.2024
Subjects	Audio-Visual Training Data augmentation Deep Learning Encoding Large Language Models Machine learning Open Domain Audio Generation Schedules Semantics Signal processing Technological innovation
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Despite recent progress, machine learning (ML) models for open-domain audio generation need to catch up to generative models for image, text, speech, and music. The lack of massive open-domain audio datasets is the main reason for this performance gap; we overcome this challenge through a novel data augmentation approach. We leverage state-of-the-art (SOTA) Large Language Models (LLMs) to enrich captions in the weakly-labeled audio dataset. We then use a SOTA video-captioning model to generate captions for the videos from which the audio data originated, and we again use LLMs to merge the audio and video captions to form a rich, large-scale dataset. We experimentally evaluate the quality of our audio-visual captions, showing a 12.5% gain in semantic score over baselines. Using our augmented dataset, we train a Latent Diffusion Model to generate in an encodec encoding latent space. Our model is novel in the current SOTA audio generation landscape due to our generation space, text encoder, noise schedule, and attention mechanism. Together, these innovations provide competitive open-domain audio generation. The samples, models, and implementation will be at https://audiojourney.github.io.
ISSN:	2379-190X
DOI:	10.1109/ICASSP48485.2024.10448220