Stack-and-Delay: A New Codebook Pattern for Music Generation

Language modeling based music generation relies on discrete representations of audio frames. An audio frame (e.g. 20ms) is typically represented by a set of discrete codes (e.g. 4) computed by a neural codec. Autoregressive decoding typically generates a few thousands of codes per song, which is pro...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 796 - 800
Main Authors	Le Lan, Gael, Nagaraja, Varun, Chang, Ernie, Kant, David, Ni, Zhaoheng, Shi, Yangyang, Iandola, Forrest, Chandra, Vikas
Format	Conference Proceeding
Language	English
Published	IEEE 14.04.2024
Subjects	audio generation Codes Decoding efficient decoding music generation Rendering (computer graphics) Rhythm Schedules Signal processing Stacking transformer decoder
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Language modeling based music generation relies on discrete representations of audio frames. An audio frame (e.g. 20ms) is typically represented by a set of discrete codes (e.g. 4) computed by a neural codec. Autoregressive decoding typically generates a few thousands of codes per song, which is prohibitively slow and implies introducing some parallel decoding. In this paper we compare different decoding strategies that aim to understand what codes can be decoded in parallel without penalizing the quality too much. We propose a novel stack-and-delay style of decoding to improve upon the vanilla (flattened codes) decoding, with a 4 fold inference speedup. This brings inference speed close to that of the previous state of the art (delay strategy). For the same inference efficiency budget the proposed approach outperforms in objective evaluations, almost closing the gap with vanilla quality-wise. The results are supported by spectral analysis and listening tests, which demonstrate that the samples produced by the new model exhibit improved high-frequency rendering and better maintenance of harmonics and rhythm patterns.
ISSN:	2379-190X
DOI:	10.1109/ICASSP48485.2024.10447392