Causal Forests for Discovering Diagnostic Language in Electronic Health Records

Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from elect...

Full description

Saved in:
Bibliographic Details
Published inApplied stochastic models in business and industry Vol. 41; no. 5
Main Authors Albano, Alessandro, Di Maria, Chiara, Sciandra, Mariangela, Plaia, Antonella
Format Journal Article
LanguageEnglish
Published 01.09.2025
Online AccessGet full text

Cover

Loading…
Abstract Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC‐III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high‐risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings.
AbstractList Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC‐III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high‐risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings.
Author Albano, Alessandro
Sciandra, Mariangela
Plaia, Antonella
Di Maria, Chiara
Author_xml – sequence: 1
  givenname: Alessandro
  orcidid: 0000-0002-4259-0710
  surname: Albano
  fullname: Albano, Alessandro
  organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy
– sequence: 2
  givenname: Chiara
  surname: Di Maria
  fullname: Di Maria, Chiara
  organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy
– sequence: 3
  givenname: Mariangela
  surname: Sciandra
  fullname: Sciandra, Mariangela
  organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy
– sequence: 4
  givenname: Antonella
  surname: Plaia
  fullname: Plaia, Antonella
  organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy
BookMark eNotkE1Lw0AURQepYFvd-AtmLaS-l_mKS4mtFQIF6T68TCYxks7ITCr4722rq3u4XO7iLNjMB-8Yu0dYIUD-SOnQrAyAKK7YHFWuMwm5ml1YZvgE8oYtUvoEQJQG52xX0jHRyDchujQl3oXIX4Zkw7eLg-9PTL0PaRosr8j3R-odHzxfj85OMfhTvXU0Th_83dkQ23TLrjsak7v7zyXbb9b7cptVu9e38rnKrMEia2yjtRENOWVF0UlDUCACKTSiJaV0YwtQOtedpg5k3srTqAWUupVWaBJL9vB3a2NIKbqu_orDgeJPjVCfTdRnE_XFhPgFKzZTPg
Cites_doi 10.14257/ijhit.2016.9.7.22
10.1038/s41746-022-00705-7
10.1017/langcog.2014.30
10.1037/a0029607
10.1214/18‐AOS1709
10.1080/01621459.2024.2393466
10.1057/s41310-020-00077-y
10.2307/1912705
10.1080/01621459.1986.10478354
10.1214/09‐AOAS285
10.1038/sdata.2016.35
10.1353/obs.2019.0001
10.1038/srep26094
10.1177/10946705241307678
10.1023/A:1010933404324
10.1016/j.jbi.2015.01.012
10.21105/joss.00037
10.1111/ectj.12097
10.1038/s41598‐021‐99990‐7
10.1287/isre.2018.0813
10.1109/TNNLS.2022.3183864
10.2196/26323
10.1016/j.knosys.2013.07.014
10.1111/insr.12610
10.1186/s12911‐018‐0597‐7
10.1080/01621459.1996.10476902
10.1038/srep41681
10.1007/s13340-016-0288-5
10.2337/db14-0691
10.1016/j.icte.2023.02.007
10.1109/JBHI.2017.2767063
10.2147/RMHP.S12985
10.1093/biomet/70.1.41
10.1111/jofi.12162
10.1093/pan/mpn018
10.1080/01621459.2017.1319839
ContentType Journal Article
DBID AAYXX
CITATION
DOI 10.1002/asmb.70038
DatabaseName CrossRef
DatabaseTitle CrossRef
DatabaseTitleList CrossRef
DeliveryMethod fulltext_linktorsrc
Discipline Mathematics
EISSN 1526-4025
ExternalDocumentID 10_1002_asmb_70038
GroupedDBID .3N
.GA
.Y3
05W
0R~
10A
1L6
1OB
1OC
23M
31~
33P
3SF
3WU
4.4
50Y
50Z
51W
51X
52M
52N
52O
52P
52S
52T
52U
52W
52X
5GY
5VS
66C
702
7PT
8-0
8-1
8-3
8-4
8-5
8UM
8VB
930
A03
AAESR
AAEVG
AAHQN
AAMMB
AAMNL
AANHP
AANLZ
AAONW
AASGY
AAXRX
AAYCA
AAYXX
AAZKR
ABCQN
ABCUV
ABEML
ABIJN
ABJNI
ABPVW
ACAHQ
ACBWZ
ACCZN
ACGFS
ACIWK
ACPOU
ACRPL
ACSCC
ACXBN
ACXQS
ACYXJ
ADBBV
ADEOM
ADIZJ
ADKYN
ADMGS
ADNMO
ADOZA
ADXAS
ADZMN
AEFGJ
AEIGN
AEIMD
AEMOZ
AENEX
AEUYR
AEYWJ
AFBPY
AFFPM
AFGKR
AFWVQ
AFZJQ
AGHNM
AGQPQ
AGXDD
AGYGG
AHBTC
AHQJS
AIDQK
AIDYY
AITYG
AIURR
AJXKR
AKVCP
ALAGY
ALMA_UNASSIGNED_HOLDINGS
ALUQN
ALVPJ
AMBMR
AMVHM
AMYDB
ATUGU
AUFTA
AZBYB
AZFZN
AZVAB
BAFTC
BDRZF
BFHJK
BHBCM
BMNLL
BMXJE
BNHUX
BROTX
BRXPI
BY8
CITATION
CS3
D-E
D-F
DCZOG
DPXWK
DR2
DRFUL
DRSTM
EBA
EBO
EBR
EBS
EBU
EJD
EMK
EPL
F00
F01
F04
FEDTE
G-S
G.N
GNP
GODZA
H.T
H.X
HF~
HGLYW
HHZ
HVGLF
HZ~
IX1
J0M
JPC
K1G
KQQ
LATKE
LAW
LC2
LC3
LEEKS
LH4
LITHE
LOXES
LP6
LP7
LUTES
LW6
LYRES
MEWTI
MK4
MRFUL
MRSTM
MSFUL
MSSTM
MXFUL
MXSTM
N04
N05
N9A
NF~
O66
O9-
OIG
P2P
P2W
P2X
P4D
Q.N
Q11
QB0
QRW
QWB
R.K
ROL
RX1
RYL
SUPJJ
TH9
UB1
W8V
W99
WBKPD
WIH
WIK
WJL
WOHZO
WQJ
WXSBR
WYISQ
XBAML
XG1
XPP
XV2
YHZ
ZL0
~IA
~WT
ID FETCH-LOGICAL-c718-bcb6673bae5c38f47a08110a5173da556bc805626f6af042d48f4d0146d4c36a3
ISSN 1524-1904
IngestDate Wed Aug 27 16:41:00 EDT 2025
IsDoiOpenAccess false
IsOpenAccess true
IsPeerReviewed true
IsScholarly true
Issue 5
Language English
LinkModel OpenURL
MergedId FETCHMERGED-LOGICAL-c718-bcb6673bae5c38f47a08110a5173da556bc805626f6af042d48f4d0146d4c36a3
ORCID 0000-0002-4259-0710
OpenAccessLink https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/asmb.70038
ParticipantIDs crossref_primary_10_1002_asmb_70038
PublicationCentury 2000
PublicationDate 2025-09-00
PublicationDateYYYYMMDD 2025-09-01
PublicationDate_xml – month: 09
  year: 2025
  text: 2025-09-00
PublicationDecade 2020
PublicationTitle Applied stochastic models in business and industry
PublicationYear 2025
References e_1_2_13_25_1
Inoguchi T. (e_1_2_13_48_1) 2016; 7
e_1_2_13_21_1
e_1_2_13_44_1
e_1_2_13_20_1
e_1_2_13_45_1
e_1_2_13_23_1
e_1_2_13_22_1
e_1_2_13_9_1
Hull T. D. (e_1_2_13_49_1) 2014; 63
e_1_2_13_6_1
Breiman L. (e_1_2_13_41_1) 2001; 45
Yadlowsky S. (e_1_2_13_46_1) 2024; 120
Albano A. (e_1_2_13_38_1) 2025
Dong H. (e_1_2_13_7_1) 2022; 5
Zhang L. (e_1_2_13_26_1) 2019; 1
Zhu B. (e_1_2_13_47_1) 2017; 7
Johnson K. W. (e_1_2_13_27_1) 2018; 23
Luo Y. (e_1_2_13_4_1) 2020; 17
Lapão L. V. (e_1_2_13_10_1) 2019
Holland P. W. (e_1_2_13_40_1) 1986; 81
Paul M. (e_1_2_13_8_1) 2023; 9
Lv X. (e_1_2_13_16_1) 2016; 9
Browne F. (e_1_2_13_2_1) 2013; 52
Wager S. (e_1_2_13_43_1) 2018; 113
e_1_2_13_19_1
e_1_2_13_13_1
Tran T. (e_1_2_13_17_1) 2015; 54
e_1_2_13_36_1
e_1_2_13_14_1
Villarroel Ordenes F. (e_1_2_13_37_1) 2025; 28
Lin Y. K. (e_1_2_13_12_1) 2019; 30
e_1_2_13_32_1
Rosenbaum P. R. (e_1_2_13_34_1) 1983; 70
Jagannatha A. N. (e_1_2_13_15_1) 2016
e_1_2_13_31_1
Rehill P. (e_1_2_13_39_1) 2025; 93
e_1_2_13_33_1
Hirano K. (e_1_2_13_35_1) 2004
Zhang J. (e_1_2_13_24_1) 2023; 2022
Athey S. (e_1_2_13_30_1) 2019; 5
Silow‐Carroll S. (e_1_2_13_11_1) 2012; 17
e_1_2_13_5_1
Robinson P. M. (e_1_2_13_42_1) 1988; 56
Loughran T. (e_1_2_13_3_1) 2014; 69
Choi Y. (e_1_2_13_18_1) 2016
e_1_2_13_29_1
e_1_2_13_28_1
References_xml – volume: 9
  start-page: 237
  issue: 7
  year: 2016
  ident: e_1_2_13_16_1
  article-title: Clinical Relation Extraction With Deep Learning
  publication-title: International Journal of Hybrid Information Technology
  doi: 10.14257/ijhit.2016.9.7.22
– volume: 5
  start-page: 159
  issue: 1
  year: 2022
  ident: e_1_2_13_7_1
  article-title: Automated Clinical Coding: What, Why, and Where We Are?
  publication-title: NPJ Digital Medicine
  doi: 10.1038/s41746-022-00705-7
– start-page: 473
  volume-title: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies
  year: 2016
  ident: e_1_2_13_15_1
– start-page: 41
  volume-title: AMIA Joint Summits on Translational Science Proceedings
  year: 2016
  ident: e_1_2_13_18_1
– ident: e_1_2_13_6_1
  doi: 10.1017/langcog.2014.30
– ident: e_1_2_13_5_1
  doi: 10.1037/a0029607
– ident: e_1_2_13_31_1
  doi: 10.1214/18‐AOS1709
– volume: 120
  start-page: 38
  issue: 549
  year: 2024
  ident: e_1_2_13_46_1
  article-title: Evaluating Treatment Prioritization Rules via Rank‐Weighted Average Treatment Effects
  publication-title: Journal of the American Statistical Association
  doi: 10.1080/01621459.2024.2393466
– volume: 17
  start-page: 101
  issue: 2
  year: 2020
  ident: e_1_2_13_4_1
  article-title: Textual Tone in Corporate Financial Disclosures: A Survey of the Literature
  publication-title: International Journal of Disclosure and Governance
  doi: 10.1057/s41310-020-00077-y
– volume: 56
  start-page: 931
  year: 1988
  ident: e_1_2_13_42_1
  article-title: Root‐N‐Consistent Semiparametric Regression
  publication-title: Econometrica
  doi: 10.2307/1912705
– start-page: 433
  volume-title: The Future of Healthcare: The Impact of Digitalization on Healthcare Services Performance
  year: 2019
  ident: e_1_2_13_10_1
– volume: 81
  start-page: 945
  issue: 396
  year: 1986
  ident: e_1_2_13_40_1
  article-title: Statistics and Causal Inference
  publication-title: Journal of the American Statistical Association
  doi: 10.1080/01621459.1986.10478354
– ident: e_1_2_13_32_1
  doi: 10.1214/09‐AOAS285
– ident: e_1_2_13_28_1
  doi: 10.1038/sdata.2016.35
– ident: e_1_2_13_29_1
– volume: 5
  start-page: 37
  issue: 2
  year: 2019
  ident: e_1_2_13_30_1
  article-title: Estimating Treatment Effects With Causal Forests: An Application
  publication-title: Observational Studies
  doi: 10.1353/obs.2019.0001
– start-page: 73
  volume-title: The Propensity Score With Continuous Treatments
  year: 2004
  ident: e_1_2_13_35_1
– volume: 2022
  start-page: 1227
  year: 2023
  ident: e_1_2_13_24_1
  article-title: Application of Causal Discovery Algorithms in Studying the Nephrotoxicity of Remdesivir Using Longitudinal Data From the EHR
  publication-title: AMIA Annual Symposium Proceedings
– ident: e_1_2_13_19_1
  doi: 10.1038/srep26094
– volume: 23
  start-page: 180
  year: 2018
  ident: e_1_2_13_27_1
  article-title: Causal Inference on Electronic Health Records to Assess Blood Pressure Treatment Targets: An Application of the Parametric g Formula
  publication-title: Pacific Symposium on Biocomputing
– volume: 28
  year: 2025
  ident: e_1_2_13_37_1
  article-title: Using Traditional Text Analysis and Large Language Models in Service Failure and Recovery
  publication-title: Journal of Service Research
  doi: 10.1177/10946705241307678
– volume: 45
  start-page: 5
  year: 2001
  ident: e_1_2_13_41_1
  article-title: Random Forests
  publication-title: Machine Learning
  doi: 10.1023/A:1010933404324
– ident: e_1_2_13_20_1
– volume: 54
  start-page: 96
  year: 2015
  ident: e_1_2_13_17_1
  article-title: Learning Vector Representation of Medical Objects via EMR‐Driven Nonnegative Restricted Boltzmann Machines (eNRBM)
  publication-title: Journal of Biomedical Informatics
  doi: 10.1016/j.jbi.2015.01.012
– ident: e_1_2_13_45_1
  doi: 10.21105/joss.00037
– ident: e_1_2_13_33_1
  doi: 10.1111/ectj.12097
– volume-title: Accepted for Publication in the Conference Proceedings of the Italian Statistical Society Meeting, to Appear in the Italian Statistical Society Series on Advances in Statistics (ISSSAS)
  year: 2025
  ident: e_1_2_13_38_1
– ident: e_1_2_13_9_1
– volume: 17
  start-page: 1
  year: 2012
  ident: e_1_2_13_11_1
  article-title: Using Electronic Health Records to Improve Quality and Efficiency: The Experiences of Leading Hospitals
  publication-title: Issue Brief (Commonwealth Fund)
– ident: e_1_2_13_22_1
  doi: 10.1038/s41598‐021‐99990‐7
– volume: 30
  start-page: 306
  issue: 1
  year: 2019
  ident: e_1_2_13_12_1
  article-title: Do Electronic Health Records Affect Quality of Care? Evidence From the HITECH Act
  publication-title: Information Systems Research
  doi: 10.1287/isre.2018.0813
– ident: e_1_2_13_25_1
  doi: 10.1109/TNNLS.2022.3183864
– ident: e_1_2_13_14_1
  doi: 10.2196/26323
– volume: 52
  start-page: 165
  year: 2013
  ident: e_1_2_13_2_1
  article-title: Integrating Textual Analysis and Evidential Reasoning for Decision Making in Engineering Design
  publication-title: Knowledge‐Based Systems
  doi: 10.1016/j.knosys.2013.07.014
– volume: 93
  start-page: 288
  year: 2025
  ident: e_1_2_13_39_1
  article-title: How Do Applied Researchers Use the Causal Forest? A Methodological Review
  publication-title: International Statistical Review
  doi: 10.1111/insr.12610
– ident: e_1_2_13_23_1
  doi: 10.1186/s12911‐018‐0597‐7
– ident: e_1_2_13_36_1
  doi: 10.1080/01621459.1996.10476902
– volume: 7
  issue: 1
  year: 2017
  ident: e_1_2_13_47_1
  article-title: Effect of Bilirubin Concentration on the Risk of Diabetic Complications: A Meta‐Analysis of Epidemiologic Studies
  publication-title: Scientific Reports
  doi: 10.1038/srep41681
– volume: 7
  start-page: 338
  year: 2016
  ident: e_1_2_13_48_1
  article-title: Bilirubin as an Important Physiological Modulator of Oxidative Stress and Chronic Inflammation in Metabolic Syndrome and Diabetes: A New Aspect on Old Molecule
  publication-title: Diabetology International
  doi: 10.1007/s13340-016-0288-5
– volume: 63
  start-page: 2613
  issue: 8
  year: 2014
  ident: e_1_2_13_49_1
  article-title: Bilirubin: A Potential Biomarker and Therapeutic Target for Diabetic Nephropathy
  publication-title: Diabetes
  doi: 10.2337/db14-0691
– volume: 9
  start-page: 571
  issue: 4
  year: 2023
  ident: e_1_2_13_8_1
  article-title: Digitization of Healthcare Sector: A Study on Privacy and Security Concerns
  publication-title: ICT Express
  doi: 10.1016/j.icte.2023.02.007
– ident: e_1_2_13_21_1
  doi: 10.1109/JBHI.2017.2767063
– volume: 1
  start-page: 22
  year: 2019
  ident: e_1_2_13_26_1
  article-title: The Medical Deconfounder: Assessing Treatment Effects With Electronic Health Records (EHRs)
  publication-title: Proceedings of Machine Learning Research
– ident: e_1_2_13_13_1
  doi: 10.2147/RMHP.S12985
– volume: 70
  start-page: 41
  issue: 1
  year: 1983
  ident: e_1_2_13_34_1
  article-title: The Central Role of the Propensity Score in Observational Studies for Causal Effects
  publication-title: Biometrika
  doi: 10.1093/biomet/70.1.41
– volume: 69
  start-page: 1643
  issue: 4
  year: 2014
  ident: e_1_2_13_3_1
  article-title: Measuring Readability in Financial Disclosures
  publication-title: Journal of Finance
  doi: 10.1111/jofi.12162
– ident: e_1_2_13_44_1
  doi: 10.1093/pan/mpn018
– volume: 113
  start-page: 1228
  issue: 523
  year: 2018
  ident: e_1_2_13_43_1
  article-title: Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests
  publication-title: Journal of the American Statistical Association
  doi: 10.1080/01621459.2017.1319839
SSID ssj0011471
Score 2.359987
Snippet Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While...
SourceID crossref
SourceType Index Database
Title Causal Forests for Discovering Diagnostic Language in Electronic Health Records
Volume 41
hasFullText 1
inHoldings 1
isFullTextHit
isPrint
link http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELYKLDAgnuItS7CmNLGdhLFqixDiJShSt8iPRFQqFWrahYmfzl2cOBEwAEtUOXZS-T7Z313O3xFy1lGwZ0Q68xgPmcel9j0gR76njYHtgYe6k-Hh5Nu78OqZX4_EqNX6aGQtLeaqrd9_PFfyH6tCG9gVT8n-wbLuodAAv8G-cAULw_VXNu7JRQ5TjOU183khrIBqmhqzMjEC0LdpdCjJelOGJTG8Magr35SHkKwPmjeJasVOgRvqF1k8oyiaU6TPqipb3ko3FcU_XGi-O1FyWh2eyXOJigiOLo_xcJDNz-29jOXMbQpPGF0xRdEj2wWTbd3dh4m0g7pY8jidlHfKaEUgXDqWW2AD7gEJsUGEtGoL0Y8VzVWZ-w30iR8XeyseK_NX1Y7wC2e9pVWf8b_sdC7_0Go1BwmOTYqxS2QlAEcDa2D0H50AGTiL1mWv_rUTuA3O6_c2KE2Dmww3yHrpVNCuRcgmaaXTLbJ26xR5821yb7FCS6xQwAptYIXWWKEVVuh4SmusUIsVWmJlhwwvB8PelVcW0_A00A9PaYUFXpVMhWZxxiMJXNDvSOFHzEghQqVj5MJhFsoMFnLDoZNBZSHDNQsl2yXLUzDwHqHmIo61z3QWKB_on7lQzAghmBKpDmQc75PTajqSNyuZknyf8INf9TokqzWKjsjyfLZIj4EFztVJYahPAaZhtQ
linkProvider Wiley-Blackwell
openUrl ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Causal+Forests+for+Discovering+Diagnostic+Language+in+Electronic+Health+Records&rft.jtitle=Applied+stochastic+models+in+business+and+industry&rft.au=Albano%2C+Alessandro&rft.au=Di+Maria%2C+Chiara&rft.au=Sciandra%2C+Mariangela&rft.au=Plaia%2C+Antonella&rft.date=2025-09-01&rft.issn=1524-1904&rft.eissn=1526-4025&rft.volume=41&rft.issue=5&rft_id=info:doi/10.1002%2Fasmb.70038&rft.externalDBID=n%2Fa&rft.externalDocID=10_1002_asmb_70038
thumbnail_l http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1524-1904&client=summon
thumbnail_m http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1524-1904&client=summon
thumbnail_s http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1524-1904&client=summon