A Comprehensive Polish Medical Speech Dataset for Enhancing Automatic Medical Dictation

Pre-trained models have become widely adopted for their strong zero-shot performance, often minimizing the need for task-specific data. However, specialized domains like medical speech recognition still benefit from tailored datasets. We present ADMEDVOICE, a novel Polish medical speech dataset, col...

Full description

Saved in:

Bibliographic Details
Published in	Scientific data Vol. 12; no. 1; pp. 1436 - 13
Main Authors	Czyżewski, Andrzej, Cygert, Sebastian, Marciniuk, Karolina, Szczodrak, Maciej, Harasimiuk, Arkadiusz, Odya, Piotr, Galanina, Marina, Szczuko, Piotr, Kostek, Bożena, Graff, Beata, Szplit, Dariusz, Budzisz, Mariusz, Narkiewicz, Krzysztof
Format	Journal Article
Language	English
Published	London Nature Publishing Group UK 16.08.2025 Nature Publishing Group Nature Portfolio
Subjects	692/700/1538 692/700/228 Automation Data Descriptor Datasets Documentation Human performance Humanities and Social Sciences Humans Language Linguistics Medical Subject Headings-MeSH multidisciplinary Multilingualism Natural language processing Neural networks Poland Science Science (multidisciplinary) Speech Speech recognition Speech Recognition Software Voice recognition Poland
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Pre-trained models have become widely adopted for their strong zero-shot performance, often minimizing the need for task-specific data. However, specialized domains like medical speech recognition still benefit from tailored datasets. We present ADMEDVOICE, a novel Polish medical speech dataset, collected using a high-quality text corpus and diverse recording conditions to reflect real-world scenarios. The dataset includes domain-specific vocabulary such as drug names and illnesses, with nearly 15 hours of audio from 28 speakers, including noisy environments. Additionally, we release two enhanced versions: one anonymized for privacy-sensitive use and another synthetic version created via text-to-speech, totaling over 83 hours and nearly 50,000 samples. Evaluating the Whisper model, we observe a 24.03 WER on our test set. Fine-tuning with human recordings reduces WER to 15.47, and incorporating anonymized and synthetic data further lowers it to 13.91. We open-source the dataset, fine-tuned model, and code on Kaggle to support continued research in medical speech recognition.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2052-4463 2052-4463
DOI:	10.1038/s41597-025-05776-1