End-to-end Lyrics Alignment for Polyphonic Music Using an Audio-to-character Recognition Model

Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules inc...

Full description

Saved in:

Bibliographic Details
Published in	ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) pp. 181 - 185
Main Authors	Stoller, Daniel, Durand, Simon, Ewert, Sebastian
Format	Conference Proceeding
Language	English
Published	IEEE 01.05.2019
Subjects	CTC training Lyrics alignment lyrics transcription multi-scale representation neural networks
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Time-aligned lyrics can enrich the music listening experience by enabling karaoke, text-based song retrieval and intra-song navigation, and other applications. Compared to text-to-speech alignment, lyrics alignment remains highly challenging, despite many attempts to combine numerous sub-modules including vocal separation and detection in an effort to break down the problem. Furthermore, training required fine-grained annotations to be available in some form. Here, we present a novel system based on a modified Wave-U-Net architecture, which predicts character probabilities directly from raw audio using learnt multi-scale representations of the various signal components. There are no sub-modules whose interdependencies need to be optimized. Our training procedure is designed to work with weak, line-level annotations available in the real world. With a mean alignment error of 0.35s on a standard dataset our system outperforms the state-of-the-art by an order of magnitude.
ISSN:	2379-190X
DOI:	10.1109/ICASSP.2019.8683470