FastSVC: Fast Cross-Domain Singing Voice Conversion with Feature-wise Linear Modulation
This paper presents FastSVC, a light-weight cross-domain singing voice conversion (SVC) system, which can achieve high conversion performance, with inference speed 4x faster than real-time on CPUs. FastSVC uses Conformer-based phoneme recognizer to extract singer-agnostic linguistic features from si...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
11.11.2020
|
Online Access | Get full text |
Cover
Loading…
Summary: | This paper presents FastSVC, a light-weight cross-domain singing voice
conversion (SVC) system, which can achieve high conversion performance, with
inference speed 4x faster than real-time on CPUs. FastSVC uses Conformer-based
phoneme recognizer to extract singer-agnostic linguistic features from singing
signals. A feature-wise linear modulation based generator is used to synthesize
waveform directly from linguistic features, leveraging information from
sine-excitation signals and loudness features. The waveform generator can be
trained conveniently using a multi-resolution spectral loss and an adversarial
loss. Experimental results show that the proposed FastSVC system, compared with
a computationally heavy baseline system, can achieve comparable conversion
performance in some scenarios and significantly better conversion performance
in other scenarios. Moreover, the proposed FastSVC system achieves desirable
cross-lingual singing conversion performance. The inference speed of the
FastSVC system is 3x and 70x faster than the baseline system on GPUs and CPUs,
respectively. |
---|---|
DOI: | 10.48550/arxiv.2011.05731 |