Automatic GRBAS Scoring of Pathological Voices using Deep Learning and a Small Set of Labeled Voice Data

Auditory-perceptual evaluation frameworks, such as the grade-roughness-breathiness-asthenia-strain (GRBAS) scale, are the gold standard for the quantitative evaluation of pathological voice quality. However, the evaluation is subjective; thus, the ratings lack reproducibility due to inter- and intra...

Full description

Saved in:
Bibliographic Details
Published inJournal of voice
Main Authors Hidaka, Shunsuke, Lee, Yogaku, Nakanishi, Moe, Wakamiya, Kohei, Nakagawa, Takashi, Kaburagi, Tokihiko
Format Journal Article
LanguageEnglish
Published United States 24.11.2022
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Auditory-perceptual evaluation frameworks, such as the grade-roughness-breathiness-asthenia-strain (GRBAS) scale, are the gold standard for the quantitative evaluation of pathological voice quality. However, the evaluation is subjective; thus, the ratings lack reproducibility due to inter- and intra-rater variation. Prior researchers have proposed deep-learning-based automatic GRBAS score estimation to address this problem. However, these methods require large amounts of labeled voice data. Therefore, this study investigates the potential of automatic GRBAS estimation using deep learning with smaller amounts of data. A dataset consisting of 300 pathological sustained /a/ vowel samples was created and rated by eight experts (200 for training, 50 for validation, and 50 for testing). A neural network model that predicts the probability distribution of GRBAS scores from an onset-to-offset waveform was proposed. Random speed perturbation, random crop, and frequency masking were investigated as data augmentation techniques, and power, instantaneous frequency, and group delay were investigated as time-frequency representations. Five-fold cross-validation was conducted, and the automatic scoring performance was evaluated using the quadratic weighted Cohen's kappa. The results showed that the kappa values of the automatic scoring performance were comparable to those of the inter-rater reliability of experts for all GRBAS items and the intra-rater reliability of experts for items G, B, A, and S. Random speed perturbation was the most effective data augmentation technique overall. When data augmentation was applied, power was the most effective for items G, R, A, and S; for Item B, combining group delay and power yielded additional performance gains. The automatic GRBAS scoring achieved by the proposed model using scant labeled data was comparable to that of experts. This suggests that the challenges resulting from insufficient data can be alleviated. The findings of this study can also contribute to performance improvements in other tasks such as automatic voice disorder detection.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 23
ISSN:0892-1997
1873-4588
DOI:10.1016/j.jvoice.2022.10.020