VocDoc, what happened to my voice? Towards automatically capturing vocal fatigue in the wild

Voice problems that arise during everyday vocal use can hardly be captured by standard outpatient voice assessments. In preparation for a digital health application to automatically assess longitudinal voice data ‘in the wild’ – the VocDoc, the aim of this paper was to study vocal fatigue from the s...

Full description

Saved in:

Bibliographic Details
Published in	Biomedical signal processing and control Vol. 88; p. 105595
Main Authors	Pokorny, Florian B., Linke, Julian, Seddiki, Nico, Lohrmann, Simon, Gerstenberger, Claus, Haspl, Katja, Feiner, Marlies, Eyben, Florian, Hagmüller, Martin, Schuppler, Barbara, Kubin, Gernot, Gugatschka, Markus
Format	Journal Article
Language	English
Published	Elsevier Ltd 01.02.2024
Subjects	Digital health Machine learning Mobile application Speech-language pathology Vocal fatigue Voice assessment Voice features Speech-language pathology Vocal fatigue Voice assessment Machine learning Voice features Digital health Mobile application
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Voice problems that arise during everyday vocal use can hardly be captured by standard outpatient voice assessments. In preparation for a digital health application to automatically assess longitudinal voice data ‘in the wild’ – the VocDoc, the aim of this paper was to study vocal fatigue from the speaker’s perspective, the healthcare professional’s perspective, and the ‘machine’s’ perspective. We collected data of four voice healthy speakers completing a 90-min reading task. Every 10 min the speakers were asked about subjective voice characteristics. Then, we elaborated on the task of elapsed speaking time recognition: We carried out listening experiments with speech and language therapists and employed random forests on the basis of extracted acoustic features. We validated our models speaker-dependently and speaker-independently and analysed underlying feature importances. For an additional, clinical application-oriented scenario, we extended our dataset for lecture recordings of another two speakers. Self- and expert-assessments were not consistent. With mean F1 scores up to 0.78, automatic elapsed speaking time recognition worked reliably in the speaker-dependent scenario only. A small set of acoustic features – other than features previously reported to reflect vocal fatigue – was found to universally describe long-term variations of the voice. Vocal fatigue seems to have individual effects across different speakers. Machine learning has the potential to automatically detect and characterise vocal changes over time. Our study provides technical underpinnings for a future mobile solution to objectively capture pathological long-term voice variations in everyday life settings and make them clinically accessible. •A few acoustic features seem to universally describe vocal fatigue.•Vocal fatigue has rather individual effects across different speakers.•Machine learning has the potential to automatically detect effects of vocal fatigue.•A mobile app can capture clinically relevant long-term voice variations in the wild.
ISSN:	1746-8094
DOI:	10.1016/j.bspc.2023.105595