Stack-and-Delay: A New Codebook Pattern for Music Generation
Language modeling based music generation relies on discrete representations of audio frames. An audio frame (e.g. 20ms) is typically represented by a set of discrete codes (e.g. 4) computed by a neural codec. Autoregressive decoding typically generates a few thousands of codes per song, which is pro...
Saved in:
Published in | ICASSP 2024 - 2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 796 - 800 |
---|---|
Main Authors | , , , , , , , |
Format | Conference Proceeding |
Language | English |
Published |
IEEE
14.04.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Language modeling based music generation relies on discrete representations of audio frames. An audio frame (e.g. 20ms) is typically represented by a set of discrete codes (e.g. 4) computed by a neural codec. Autoregressive decoding typically generates a few thousands of codes per song, which is prohibitively slow and implies introducing some parallel decoding. In this paper we compare different decoding strategies that aim to understand what codes can be decoded in parallel without penalizing the quality too much. We propose a novel stack-and-delay style of decoding to improve upon the vanilla (flattened codes) decoding, with a 4 fold inference speedup. This brings inference speed close to that of the previous state of the art (delay strategy). For the same inference efficiency budget the proposed approach outperforms in objective evaluations, almost closing the gap with vanilla quality-wise. The results are supported by spectral analysis and listening tests, which demonstrate that the samples produced by the new model exhibit improved high-frequency rendering and better maintenance of harmonics and rhythm patterns. |
---|---|
ISSN: | 2379-190X |
DOI: | 10.1109/ICASSP48485.2024.10447392 |