Stack-and-Delay: A New Codebook Pattern for Music Generation

Language modeling based music generation relies on discrete representations of audio frames. An audio frame (e.g. 20ms) is typically represented by a set of discrete codes (e.g. 4) computed by a neural codec. Autoregressive decoding typically generates a few thousands of codes per song, which is pro...

Full description

Saved in:
Bibliographic Details
Published inICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 796 - 800
Main Authors Le Lan, Gael, Nagaraja, Varun, Chang, Ernie, Kant, David, Ni, Zhaoheng, Shi, Yangyang, Iandola, Forrest, Chandra, Vikas
Format Conference Proceeding
LanguageEnglish
Published IEEE 14.04.2024
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Language modeling based music generation relies on discrete representations of audio frames. An audio frame (e.g. 20ms) is typically represented by a set of discrete codes (e.g. 4) computed by a neural codec. Autoregressive decoding typically generates a few thousands of codes per song, which is prohibitively slow and implies introducing some parallel decoding. In this paper we compare different decoding strategies that aim to understand what codes can be decoded in parallel without penalizing the quality too much. We propose a novel stack-and-delay style of decoding to improve upon the vanilla (flattened codes) decoding, with a 4 fold inference speedup. This brings inference speed close to that of the previous state of the art (delay strategy). For the same inference efficiency budget the proposed approach outperforms in objective evaluations, almost closing the gap with vanilla quality-wise. The results are supported by spectral analysis and listening tests, which demonstrate that the samples produced by the new model exhibit improved high-frequency rendering and better maintenance of harmonics and rhythm patterns.
ISSN:2379-190X
DOI:10.1109/ICASSP48485.2024.10447392