Dysarthric Speech Recognition Using Pseudo-Labeling, Self-Supervised Feature Learning, and a Joint Multi-Task Learning Approach

In this paper, we investigate the use of the spontaneous speech of dysarthric people for training an automatic speech recognition (ASR) model for them. Although the spontaneous speech of dysarthric people can be collected relatively easily compared to script-reading speech, which is obtained by havi...

Full description

Saved in:

Bibliographic Details
Published in	IEEE access Vol. 12; pp. 36990 - 36999
Main Authors	Takashima, Ryoichi, Sawa, Yuya, Aihara, Ryo, Takiguchi, Tetsuya, Imai, Yoshie
Format	Journal Article
Language	English
Published	Piscataway IEEE 01.01.2024 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Automatic speech recognition Data models dysarthria Errors Japanese language Labeling Labels Machine learning Multitasking pseudo-labeling Representation learning self-supervised feature learning Self-supervised learning Speech Speech recognition Spontaneous speech Task analysis Training Voice recognition
Online Access	Get full text
ISSN	2169-3536 2169-3536
DOI	10.1109/ACCESS.2024.3374874

Cover

More Information
Summary:	In this paper, we investigate the use of the spontaneous speech of dysarthric people for training an automatic speech recognition (ASR) model for them. Although the spontaneous speech of dysarthric people can be collected relatively easily compared to script-reading speech, which is obtained by having them read a prepared script, labeling the spontaneous speech of dysarthric people is very difficult and costly. For training an ASR model using unlabeled speech data, pseudo-labeling and self-supervised feature learning have been studied as effective approaches; however, the effectiveness of these approaches has not been clear when they are applied to the unlabeled dysarthric speech. In addition, pseudo-labeling may not be effective since the pseudo-labels of dysarthric speech include many errors and are not reliable. In this paper, we evaluate the above two approaches for the dysarthric speech recognition, and we propose a multi-task learning approach, which combines these approaches to train an ASR model that is robust against the errors in the pseudo-labels. Experimental results using Japanese and English datasets demonstrated that all approaches are effective, but among them, the proposed multi-task learning approach showed the best performance.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	2169-3536 2169-3536
DOI:	10.1109/ACCESS.2024.3374874