Pitchtron: Towards audiobook generation from ordinary people's voices
In this paper, we explore prosody transfer for audiobook generation under rather realistic condition where training DB is plain audio mostly from multiple ordinary people and reference audio given during inference is from professional and richer in prosody than training DB. To be specific, we explor...
Saved in:
Main Authors | , |
---|---|
Format | Journal Article |
Language | English |
Published |
21.05.2020
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | In this paper, we explore prosody transfer for audiobook generation under
rather realistic condition where training DB is plain audio mostly from
multiple ordinary people and reference audio given during inference is from
professional and richer in prosody than training DB. To be specific, we explore
transferring Korean dialects and emotive speech even though training set is
mostly composed of standard and neutral Korean. We found that under this
setting, original global style token method generates undesirable glitches in
pitch, energy and pause length. To deal with this issue, we propose two models,
hard and soft pitchtron and release the toolkit and corpus that we have
developed. Hard pitchtron uses pitch as input to the decoder while soft
pitchtron uses pitch as input to the prosody encoder. We verify the
effectiveness of proposed models with objective and subjective tests. AXY score
over GST is 2.01 and 1.14 for hard pitchtron and soft pitchtron respectively. |
---|---|
DOI: | 10.48550/arxiv.2005.10456 |