Improving Target Sound Extraction with Timestamp Information
Target sound extraction (TSE) aims to extract the sound part of a target sound event class from a mixture audio with multiple sound events. The previous works mainly focus on the problems of weakly-labelled data, jointly learning and new classes, however, no one cares about the onset and offset time...
Saved in:
Main Authors | , , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
02.04.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Target sound extraction (TSE) aims to extract the sound part of a target
sound event class from a mixture audio with multiple sound events.
The previous works mainly focus on the problems of weakly-labelled data,
jointly learning and new classes, however, no one cares about the onset and
offset times of the target sound event, which has been emphasized in the
auditory scene analysis. In this paper, we study to utilize such timestamp
information to help extract the target sound via a target sound detection
network and a target-weighted time-frequency loss function.
More specifically, we use the detection result of a target sound detection
(TSD) network as the additional information to guide the learning of target
sound extraction network. We also find that the result of TSE can further
improve the performance of the TSD network, so that a mutual learning framework
of the target sound detection and extraction is proposed. In addition, a
target-weighted time-frequency loss function is designed to pay more attention
to the temporal regions of the target sound during training. Experimental
results on the synthesized data generated from the Freesound Datasets show that
our proposed method can significantly improve the performance of TSE. |
---|---|
DOI: | 10.48550/arxiv.2204.00821 |