An Empirical Analysis on the Vulnerabilities of End-to-End Speech Segregation Models
End-to-end learning models have demonstrated a remarkable capability in performing speech segregation. Despite their wide-scope of real-world applications, little is known about the mechanisms they employ to group and consequently segregate individual speakers. Knowing that harmonicity is a critical...
Saved in:
Main Authors | , , , |
---|---|
Format | Journal Article |
Language | English |
Published |
19.06.2022
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | End-to-end learning models have demonstrated a remarkable capability in
performing speech segregation. Despite their wide-scope of real-world
applications, little is known about the mechanisms they employ to group and
consequently segregate individual speakers. Knowing that harmonicity is a
critical cue for these networks to group sources, in this work, we perform a
thorough investigation on ConvTasnet and DPT-Net to analyze how they perform a
harmonic analysis of the input mixture. We perform ablation studies where we
apply low-pass, high-pass, and band-stop filters of varying pass-bands to
empirically analyze the harmonics most critical for segregation. We also
investigate how these networks decide which output channel to assign to an
estimated source by introducing discontinuities in synthetic mixtures. We find
that end-to-end networks are highly unstable, and perform poorly when
confronted with deformations which are imperceptible to humans. Replacing the
encoder in these networks with a spectrogram leads to lower overall
performance, but much higher stability. This work helps us to understand what
information these network rely on for speech segregation, and exposes two
sources of generalization-errors. It also pinpoints the encoder as the part of
the network responsible for these errors, allowing for a redesign with expert
knowledge or transfer learning. |
---|---|
DOI: | 10.48550/arxiv.2206.09556 |