Causal Forests for Discovering Diagnostic Language in Electronic Health Records

Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from elect...

Full description

Saved in:
Bibliographic Details
Published inApplied stochastic models in business and industry Vol. 41; no. 5
Main Authors Albano, Alessandro, Di Maria, Chiara, Sciandra, Mariangela, Plaia, Antonella
Format Journal Article
LanguageEnglish
Published 01.09.2025
Online AccessGet full text

Cover

Loading…
More Information
Summary:Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC‐III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high‐risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings.
ISSN:1524-1904
1526-4025
DOI:10.1002/asmb.70038