Toward Cross-Hospital Deployment of Natural Language Processing Systems: Model Development and Validation of Fine-Tuned Large Language Models for Disease Name Recognition in Japanese

Disease name recognition is a fundamental task in clinical natural language processing, enabling the extraction of critical patient information from electronic health records. While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and litt...

Full description

Saved in:

Bibliographic Details
Published in	JMIR medical informatics Vol. 13; p. e76773
Main Authors	Shimizu, Seiji, Nishiyama, Tomohiro, Nagai, Hiroyuki, Wakamiya, Shoko, Aramaki, Eiji
Format	Journal Article
Language	English
Published	Canada JMIR Publications 08.07.2025
Subjects	Abscesses Ambient AI Scribes and AI-Driven Documentation Technologies Anemia Annotations Appendicitis Artificial Intelligence Case reports Chronic obstructive pulmonary disease Clinical Informatics Colorectal cancer Decision Support for Health Professionals Disease - classification Documentation Documents Electronic Health Records Gastric cancer Hemorrhage Humans Japan Japanese language Language Large Language Models Liver Lung cancer Metastasis Natural Language Processing Original Paper Pneumonia Japan clinical corpus clinical NLP Japanese language clinical natural language processing named entity recognition large language models out-of-domain robustness
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Disease name recognition is a fundamental task in clinical natural language processing, enabling the extraction of critical patient information from electronic health records. While recent advances in large language models (LLMs) have shown promise, most evaluations have focused on English, and little is known about their robustness in low-resource languages such as Japanese. In particular, whether these models can perform reliably on previously unseen in-hospital data, which differs from training data in writing styles and clinical contexts, has not been thoroughly investigated. This study evaluated the robustness of fine-tuned LLMs for disease name recognition in Japanese clinical notes, with a particular focus on their performance on in-hospital data that was not included during training. We used two corpora for this study: (1) a publicly available set of Japanese case reports denoted as CR, and (2) a newly constructed corpus of progress notes, denoted as PN, written by ten physicians to capture stylistic variations of in-hospital clinical notes. To reflect real-world deployment scenarios, we first fine-tuned models on CR. Specifically, we compared a LLM and a baseline-masked language model (MLM). These models were then evaluated under two conditions: (1) on CR, representing the in-domain (ID) setting with the same document type, similar to training, and (2) on PN, representing the out-of-domain (OOD) setting with a different document type. Robustness was assessed by calculating the performance gap (ie, the performance drop from in-domain to out-of-domain settings). The LLM demonstrated greater robustness, with a smaller performance gap in F1-scores (ID-OOD = -8.6) compared to the MLM baseline performance (ID-OOD = -13.9). This indicated more stable performance across ID and OOD settings, highlighting the effectiveness of fine-tuned LLMs for reliable use in diverse clinical settings. Fine-tuned LLMs demonstrate superior robustness for disease name recognition in Japanese clinical notes, with a smaller performance gap. These findings highlight the potential of LLMs as reliable tools for clinical natural language processing in low-resource language settings and support their deployment in real-world health care applications, where diversity in documentation is inevitable.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23
ISSN:	2291-9694 2291-9694
DOI:	10.2196/76773