Constructing a Singing Style Caption Dataset
Singing voice synthesis and conversion have emerged as significant subdomains of voice generation, leading to much demands on prompt-conditioned generation. Unlike common voice data, generating a singing voice requires an understanding of various associated vocal and musical characteristics, such as...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
15.09.2024
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Singing voice synthesis and conversion have emerged as significant subdomains
of voice generation, leading to much demands on prompt-conditioned generation.
Unlike common voice data, generating a singing voice requires an understanding
of various associated vocal and musical characteristics, such as the vocal tone
of the singer or emotional expressions. However, existing open-source
audio-text datasets for voice generation tend to capture only a very limited
range of attributes, often missing musical characteristics of the audio. To
fill this gap, we introduce S2Cap, an audio-text pair dataset with a diverse
set of attributes. S2Cap consists of pairs of textual prompts and music audio
samples with a wide range of vocal and musical attributes, including pitch,
volume, tempo, mood, singer's gender and age, and musical genre and emotional
expression. Utilizing S2Cap, we suggest an effective novel baseline algorithm
for singing style captioning. Singing style captioning is a relative task to
voice generation that generates text descriptions of vocal characteristics,
which we first suggested. First, to mitigate the misalignment between the audio
encoder and the text decoder, we present a novel mechanism called CRESCENDO,
which utilizes positive-pair similarity learning to synchronize the embedding
spaces of a pretrained audio encoder to get similar embeddings with a text
encoder. We additionally supervise the model using the singer's voice, which is
demixed by the accompaniment. This supervision allows the model to more
accurately capture vocal characteristics, leading to improved singing style
captions that better reflect the style of the singer. The dataset and the codes
are available at \bulurl{https://github.com/HJ-Ok/S2cap}. |
---|---|
DOI: | 10.48550/arxiv.2409.09866 |