A Comprehensive Polish Medical Speech Dataset for Enhancing Automatic Medical Dictation
Pre-trained models have become widely adopted for their strong zero-shot performance, often minimizing the need for task-specific data. However, specialized domains like medical speech recognition still benefit from tailored datasets. We present ADMEDVOICE, a novel Polish medical speech dataset, col...
Saved in:
Published in | Scientific data Vol. 12; no. 1; pp. 1436 - 13 |
---|---|
Main Authors | , , , , , , , , , , , , |
Format | Journal Article |
Language | English |
Published |
London
Nature Publishing Group UK
16.08.2025
Nature Publishing Group Nature Portfolio |
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Pre-trained models have become widely adopted for their strong zero-shot performance, often minimizing the need for task-specific data. However, specialized domains like medical speech recognition still benefit from tailored datasets. We present ADMEDVOICE, a novel Polish medical speech dataset, collected using a high-quality text corpus and diverse recording conditions to reflect real-world scenarios. The dataset includes domain-specific vocabulary such as drug names and illnesses, with nearly 15 hours of audio from 28 speakers, including noisy environments. Additionally, we release two enhanced versions: one anonymized for privacy-sensitive use and another synthetic version created via text-to-speech, totaling over 83 hours and nearly 50,000 samples. Evaluating the Whisper model, we observe a 24.03 WER on our test set. Fine-tuning with human recordings reduces WER to 15.47, and incorporating anonymized and synthetic data further lowers it to 13.91. We open-source the dataset, fine-tuned model, and code on Kaggle to support continued research in medical speech recognition. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 |
ISSN: | 2052-4463 2052-4463 |
DOI: | 10.1038/s41597-025-05776-1 |