Adversarial Continual Learning to Transfer Self-Supervised Speech Representations for Voice Pathology Detection

In recent years, voice pathology detection (VPD) has received considerable attention because of the increasing risk of voice problems. Several methods, such as support vector machine and convolutional neural network-based models, achieve good VPD performance. To further improve the performance, we u...

Full description

Saved in:

Bibliographic Details
Published in	IEEE signal processing letters Vol. 30; pp. 1 - 5
Main Authors	Park, Dongkeon, Yu, Yechan, Katabi, Dina, Kim, Hong Kook
Format	Journal Article
Language	English
Published	New York IEEE 01.01.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Adaptation models Adversarial regularization Artificial neural networks Context modeling continual learning Data models Feature extraction fine-tuning Learning Pathology Performance enhancement Regularization Representations self-supervised pretrained model Speech Support vector machines Task analysis voice pathology detection Voice recognition wav2vec 2.0
Online Access	Get full text

Cover

Loading…

More Information
Summary:	In recent years, voice pathology detection (VPD) has received considerable attention because of the increasing risk of voice problems. Several methods, such as support vector machine and convolutional neural network-based models, achieve good VPD performance. To further improve the performance, we use a self-supervised pretrained model as feature representation instead of explicit speech features. When the pretrained model is fine-tuned for VPD, an overfitting problem occurs due to a domain shift from conversation speech to the VPD task. To mitigate this problem, we propose an adversarial task adaptive pretraining (A-TAPT) approach by incorporating adversarial regularization during the continual learning process. Experiments on VPD using the Saarbrucken Voice Database show that the proposed A-TAPT improves the unweighted average recall (UAR) by an absolute increase of 12.36% and 15.38% compared with SVM and ResNet50, respectively. It is also shown that the proposed A-TAPT achieves a UAR that is 2.77% higher than that of conventional TAPT learning.
ISSN:	1070-9908 1558-2361
DOI:	10.1109/LSP.2023.3298532