Causal Forests for Discovering Diagnostic Language in Electronic Health Records
Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from elect...
Saved in:
Published in | Applied stochastic models in business and industry Vol. 41; no. 5 |
---|---|
Main Authors | , , , |
Format | Journal Article |
Language | English |
Published |
01.09.2025
|
Online Access | Get full text |
Cover
Loading…
Abstract | Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC‐III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high‐risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings. |
---|---|
AbstractList | Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While traditional approaches often focus on associational methods, this paper explores the application of causal forests to analyze textual data from electronic health records (EHRs), aiming to identify causal relationships between specific words and the likelihood of receiving certain medical diagnoses. Utilizing the MIMIC‐III dataset, we assess how linguistic factors influence diagnosis probabilities for three conditions: diabetes, hypothyroidism, and adrenal gland disorders. Our findings reveal significant causal links between certain clinical terms and diagnosis probabilities, emphasizing the potential of causal inference techniques to improve the analysis of language in clinical narratives. Additionally, we uncover heterogeneity in treatment effects, demonstrating that specific words can identify high‐risk patient subgroups. This study highlights the importance of integrating causal inference in natural language processing within healthcare settings. |
Author | Albano, Alessandro Sciandra, Mariangela Plaia, Antonella Di Maria, Chiara |
Author_xml | – sequence: 1 givenname: Alessandro orcidid: 0000-0002-4259-0710 surname: Albano fullname: Albano, Alessandro organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy – sequence: 2 givenname: Chiara surname: Di Maria fullname: Di Maria, Chiara organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy – sequence: 3 givenname: Mariangela surname: Sciandra fullname: Sciandra, Mariangela organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy – sequence: 4 givenname: Antonella surname: Plaia fullname: Plaia, Antonella organization: Department of Economics, Business, and Statistics University of Palermo Palermo Italy |
BookMark | eNotkE1Lw0AURQepYFvd-AtmLaS-l_mKS4mtFQIF6T68TCYxks7ITCr4722rq3u4XO7iLNjMB-8Yu0dYIUD-SOnQrAyAKK7YHFWuMwm5ml1YZvgE8oYtUvoEQJQG52xX0jHRyDchujQl3oXIX4Zkw7eLg-9PTL0PaRosr8j3R-odHzxfj85OMfhTvXU0Th_83dkQ23TLrjsak7v7zyXbb9b7cptVu9e38rnKrMEia2yjtRENOWVF0UlDUCACKTSiJaV0YwtQOtedpg5k3srTqAWUupVWaBJL9vB3a2NIKbqu_orDgeJPjVCfTdRnE_XFhPgFKzZTPg |
Cites_doi | 10.14257/ijhit.2016.9.7.22 10.1038/s41746-022-00705-7 10.1017/langcog.2014.30 10.1037/a0029607 10.1214/18‐AOS1709 10.1080/01621459.2024.2393466 10.1057/s41310-020-00077-y 10.2307/1912705 10.1080/01621459.1986.10478354 10.1214/09‐AOAS285 10.1038/sdata.2016.35 10.1353/obs.2019.0001 10.1038/srep26094 10.1177/10946705241307678 10.1023/A:1010933404324 10.1016/j.jbi.2015.01.012 10.21105/joss.00037 10.1111/ectj.12097 10.1038/s41598‐021‐99990‐7 10.1287/isre.2018.0813 10.1109/TNNLS.2022.3183864 10.2196/26323 10.1016/j.knosys.2013.07.014 10.1111/insr.12610 10.1186/s12911‐018‐0597‐7 10.1080/01621459.1996.10476902 10.1038/srep41681 10.1007/s13340-016-0288-5 10.2337/db14-0691 10.1016/j.icte.2023.02.007 10.1109/JBHI.2017.2767063 10.2147/RMHP.S12985 10.1093/biomet/70.1.41 10.1111/jofi.12162 10.1093/pan/mpn018 10.1080/01621459.2017.1319839 |
ContentType | Journal Article |
DBID | AAYXX CITATION |
DOI | 10.1002/asmb.70038 |
DatabaseName | CrossRef |
DatabaseTitle | CrossRef |
DatabaseTitleList | CrossRef |
DeliveryMethod | fulltext_linktorsrc |
Discipline | Mathematics |
EISSN | 1526-4025 |
ExternalDocumentID | 10_1002_asmb_70038 |
GroupedDBID | .3N .GA .Y3 05W 0R~ 10A 1L6 1OB 1OC 23M 31~ 33P 3SF 3WU 4.4 50Y 50Z 51W 51X 52M 52N 52O 52P 52S 52T 52U 52W 52X 5GY 5VS 66C 702 7PT 8-0 8-1 8-3 8-4 8-5 8UM 8VB 930 A03 AAESR AAEVG AAHQN AAMMB AAMNL AANHP AANLZ AAONW AASGY AAXRX AAYCA AAYXX AAZKR ABCQN ABCUV ABEML ABIJN ABJNI ABPVW ACAHQ ACBWZ ACCZN ACGFS ACIWK ACPOU ACRPL ACSCC ACXBN ACXQS ACYXJ ADBBV ADEOM ADIZJ ADKYN ADMGS ADNMO ADOZA ADXAS ADZMN AEFGJ AEIGN AEIMD AEMOZ AENEX AEUYR AEYWJ AFBPY AFFPM AFGKR AFWVQ AFZJQ AGHNM AGQPQ AGXDD AGYGG AHBTC AHQJS AIDQK AIDYY AITYG AIURR AJXKR AKVCP ALAGY ALMA_UNASSIGNED_HOLDINGS ALUQN ALVPJ AMBMR AMVHM AMYDB ATUGU AUFTA AZBYB AZFZN AZVAB BAFTC BDRZF BFHJK BHBCM BMNLL BMXJE BNHUX BROTX BRXPI BY8 CITATION CS3 D-E D-F DCZOG DPXWK DR2 DRFUL DRSTM EBA EBO EBR EBS EBU EJD EMK EPL F00 F01 F04 FEDTE G-S G.N GNP GODZA H.T H.X HF~ HGLYW HHZ HVGLF HZ~ IX1 J0M JPC K1G KQQ LATKE LAW LC2 LC3 LEEKS LH4 LITHE LOXES LP6 LP7 LUTES LW6 LYRES MEWTI MK4 MRFUL MRSTM MSFUL MSSTM MXFUL MXSTM N04 N05 N9A NF~ O66 O9- OIG P2P P2W P2X P4D Q.N Q11 QB0 QRW QWB R.K ROL RX1 RYL SUPJJ TH9 UB1 W8V W99 WBKPD WIH WIK WJL WOHZO WQJ WXSBR WYISQ XBAML XG1 XPP XV2 YHZ ZL0 ~IA ~WT |
ID | FETCH-LOGICAL-c718-bcb6673bae5c38f47a08110a5173da556bc805626f6af042d48f4d0146d4c36a3 |
ISSN | 1524-1904 |
IngestDate | Wed Aug 27 16:41:00 EDT 2025 |
IsDoiOpenAccess | false |
IsOpenAccess | true |
IsPeerReviewed | true |
IsScholarly | true |
Issue | 5 |
Language | English |
LinkModel | OpenURL |
MergedId | FETCHMERGED-LOGICAL-c718-bcb6673bae5c38f47a08110a5173da556bc805626f6af042d48f4d0146d4c36a3 |
ORCID | 0000-0002-4259-0710 |
OpenAccessLink | https://onlinelibrary.wiley.com/doi/pdfdirect/10.1002/asmb.70038 |
ParticipantIDs | crossref_primary_10_1002_asmb_70038 |
PublicationCentury | 2000 |
PublicationDate | 2025-09-00 |
PublicationDateYYYYMMDD | 2025-09-01 |
PublicationDate_xml | – month: 09 year: 2025 text: 2025-09-00 |
PublicationDecade | 2020 |
PublicationTitle | Applied stochastic models in business and industry |
PublicationYear | 2025 |
References | e_1_2_13_25_1 Inoguchi T. (e_1_2_13_48_1) 2016; 7 e_1_2_13_21_1 e_1_2_13_44_1 e_1_2_13_20_1 e_1_2_13_45_1 e_1_2_13_23_1 e_1_2_13_22_1 e_1_2_13_9_1 Hull T. D. (e_1_2_13_49_1) 2014; 63 e_1_2_13_6_1 Breiman L. (e_1_2_13_41_1) 2001; 45 Yadlowsky S. (e_1_2_13_46_1) 2024; 120 Albano A. (e_1_2_13_38_1) 2025 Dong H. (e_1_2_13_7_1) 2022; 5 Zhang L. (e_1_2_13_26_1) 2019; 1 Zhu B. (e_1_2_13_47_1) 2017; 7 Johnson K. W. (e_1_2_13_27_1) 2018; 23 Luo Y. (e_1_2_13_4_1) 2020; 17 Lapão L. V. (e_1_2_13_10_1) 2019 Holland P. W. (e_1_2_13_40_1) 1986; 81 Paul M. (e_1_2_13_8_1) 2023; 9 Lv X. (e_1_2_13_16_1) 2016; 9 Browne F. (e_1_2_13_2_1) 2013; 52 Wager S. (e_1_2_13_43_1) 2018; 113 e_1_2_13_19_1 e_1_2_13_13_1 Tran T. (e_1_2_13_17_1) 2015; 54 e_1_2_13_36_1 e_1_2_13_14_1 Villarroel Ordenes F. (e_1_2_13_37_1) 2025; 28 Lin Y. K. (e_1_2_13_12_1) 2019; 30 e_1_2_13_32_1 Rosenbaum P. R. (e_1_2_13_34_1) 1983; 70 Jagannatha A. N. (e_1_2_13_15_1) 2016 e_1_2_13_31_1 Rehill P. (e_1_2_13_39_1) 2025; 93 e_1_2_13_33_1 Hirano K. (e_1_2_13_35_1) 2004 Zhang J. (e_1_2_13_24_1) 2023; 2022 Athey S. (e_1_2_13_30_1) 2019; 5 Silow‐Carroll S. (e_1_2_13_11_1) 2012; 17 e_1_2_13_5_1 Robinson P. M. (e_1_2_13_42_1) 1988; 56 Loughran T. (e_1_2_13_3_1) 2014; 69 Choi Y. (e_1_2_13_18_1) 2016 e_1_2_13_29_1 e_1_2_13_28_1 |
References_xml | – volume: 9 start-page: 237 issue: 7 year: 2016 ident: e_1_2_13_16_1 article-title: Clinical Relation Extraction With Deep Learning publication-title: International Journal of Hybrid Information Technology doi: 10.14257/ijhit.2016.9.7.22 – volume: 5 start-page: 159 issue: 1 year: 2022 ident: e_1_2_13_7_1 article-title: Automated Clinical Coding: What, Why, and Where We Are? publication-title: NPJ Digital Medicine doi: 10.1038/s41746-022-00705-7 – start-page: 473 volume-title: Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies year: 2016 ident: e_1_2_13_15_1 – start-page: 41 volume-title: AMIA Joint Summits on Translational Science Proceedings year: 2016 ident: e_1_2_13_18_1 – ident: e_1_2_13_6_1 doi: 10.1017/langcog.2014.30 – ident: e_1_2_13_5_1 doi: 10.1037/a0029607 – ident: e_1_2_13_31_1 doi: 10.1214/18‐AOS1709 – volume: 120 start-page: 38 issue: 549 year: 2024 ident: e_1_2_13_46_1 article-title: Evaluating Treatment Prioritization Rules via Rank‐Weighted Average Treatment Effects publication-title: Journal of the American Statistical Association doi: 10.1080/01621459.2024.2393466 – volume: 17 start-page: 101 issue: 2 year: 2020 ident: e_1_2_13_4_1 article-title: Textual Tone in Corporate Financial Disclosures: A Survey of the Literature publication-title: International Journal of Disclosure and Governance doi: 10.1057/s41310-020-00077-y – volume: 56 start-page: 931 year: 1988 ident: e_1_2_13_42_1 article-title: Root‐N‐Consistent Semiparametric Regression publication-title: Econometrica doi: 10.2307/1912705 – start-page: 433 volume-title: The Future of Healthcare: The Impact of Digitalization on Healthcare Services Performance year: 2019 ident: e_1_2_13_10_1 – volume: 81 start-page: 945 issue: 396 year: 1986 ident: e_1_2_13_40_1 article-title: Statistics and Causal Inference publication-title: Journal of the American Statistical Association doi: 10.1080/01621459.1986.10478354 – ident: e_1_2_13_32_1 doi: 10.1214/09‐AOAS285 – ident: e_1_2_13_28_1 doi: 10.1038/sdata.2016.35 – ident: e_1_2_13_29_1 – volume: 5 start-page: 37 issue: 2 year: 2019 ident: e_1_2_13_30_1 article-title: Estimating Treatment Effects With Causal Forests: An Application publication-title: Observational Studies doi: 10.1353/obs.2019.0001 – start-page: 73 volume-title: The Propensity Score With Continuous Treatments year: 2004 ident: e_1_2_13_35_1 – volume: 2022 start-page: 1227 year: 2023 ident: e_1_2_13_24_1 article-title: Application of Causal Discovery Algorithms in Studying the Nephrotoxicity of Remdesivir Using Longitudinal Data From the EHR publication-title: AMIA Annual Symposium Proceedings – ident: e_1_2_13_19_1 doi: 10.1038/srep26094 – volume: 23 start-page: 180 year: 2018 ident: e_1_2_13_27_1 article-title: Causal Inference on Electronic Health Records to Assess Blood Pressure Treatment Targets: An Application of the Parametric g Formula publication-title: Pacific Symposium on Biocomputing – volume: 28 year: 2025 ident: e_1_2_13_37_1 article-title: Using Traditional Text Analysis and Large Language Models in Service Failure and Recovery publication-title: Journal of Service Research doi: 10.1177/10946705241307678 – volume: 45 start-page: 5 year: 2001 ident: e_1_2_13_41_1 article-title: Random Forests publication-title: Machine Learning doi: 10.1023/A:1010933404324 – ident: e_1_2_13_20_1 – volume: 54 start-page: 96 year: 2015 ident: e_1_2_13_17_1 article-title: Learning Vector Representation of Medical Objects via EMR‐Driven Nonnegative Restricted Boltzmann Machines (eNRBM) publication-title: Journal of Biomedical Informatics doi: 10.1016/j.jbi.2015.01.012 – ident: e_1_2_13_45_1 doi: 10.21105/joss.00037 – ident: e_1_2_13_33_1 doi: 10.1111/ectj.12097 – volume-title: Accepted for Publication in the Conference Proceedings of the Italian Statistical Society Meeting, to Appear in the Italian Statistical Society Series on Advances in Statistics (ISSSAS) year: 2025 ident: e_1_2_13_38_1 – ident: e_1_2_13_9_1 – volume: 17 start-page: 1 year: 2012 ident: e_1_2_13_11_1 article-title: Using Electronic Health Records to Improve Quality and Efficiency: The Experiences of Leading Hospitals publication-title: Issue Brief (Commonwealth Fund) – ident: e_1_2_13_22_1 doi: 10.1038/s41598‐021‐99990‐7 – volume: 30 start-page: 306 issue: 1 year: 2019 ident: e_1_2_13_12_1 article-title: Do Electronic Health Records Affect Quality of Care? Evidence From the HITECH Act publication-title: Information Systems Research doi: 10.1287/isre.2018.0813 – ident: e_1_2_13_25_1 doi: 10.1109/TNNLS.2022.3183864 – ident: e_1_2_13_14_1 doi: 10.2196/26323 – volume: 52 start-page: 165 year: 2013 ident: e_1_2_13_2_1 article-title: Integrating Textual Analysis and Evidential Reasoning for Decision Making in Engineering Design publication-title: Knowledge‐Based Systems doi: 10.1016/j.knosys.2013.07.014 – volume: 93 start-page: 288 year: 2025 ident: e_1_2_13_39_1 article-title: How Do Applied Researchers Use the Causal Forest? A Methodological Review publication-title: International Statistical Review doi: 10.1111/insr.12610 – ident: e_1_2_13_23_1 doi: 10.1186/s12911‐018‐0597‐7 – ident: e_1_2_13_36_1 doi: 10.1080/01621459.1996.10476902 – volume: 7 issue: 1 year: 2017 ident: e_1_2_13_47_1 article-title: Effect of Bilirubin Concentration on the Risk of Diabetic Complications: A Meta‐Analysis of Epidemiologic Studies publication-title: Scientific Reports doi: 10.1038/srep41681 – volume: 7 start-page: 338 year: 2016 ident: e_1_2_13_48_1 article-title: Bilirubin as an Important Physiological Modulator of Oxidative Stress and Chronic Inflammation in Metabolic Syndrome and Diabetes: A New Aspect on Old Molecule publication-title: Diabetology International doi: 10.1007/s13340-016-0288-5 – volume: 63 start-page: 2613 issue: 8 year: 2014 ident: e_1_2_13_49_1 article-title: Bilirubin: A Potential Biomarker and Therapeutic Target for Diabetic Nephropathy publication-title: Diabetes doi: 10.2337/db14-0691 – volume: 9 start-page: 571 issue: 4 year: 2023 ident: e_1_2_13_8_1 article-title: Digitization of Healthcare Sector: A Study on Privacy and Security Concerns publication-title: ICT Express doi: 10.1016/j.icte.2023.02.007 – ident: e_1_2_13_21_1 doi: 10.1109/JBHI.2017.2767063 – volume: 1 start-page: 22 year: 2019 ident: e_1_2_13_26_1 article-title: The Medical Deconfounder: Assessing Treatment Effects With Electronic Health Records (EHRs) publication-title: Proceedings of Machine Learning Research – ident: e_1_2_13_13_1 doi: 10.2147/RMHP.S12985 – volume: 70 start-page: 41 issue: 1 year: 1983 ident: e_1_2_13_34_1 article-title: The Central Role of the Propensity Score in Observational Studies for Causal Effects publication-title: Biometrika doi: 10.1093/biomet/70.1.41 – volume: 69 start-page: 1643 issue: 4 year: 2014 ident: e_1_2_13_3_1 article-title: Measuring Readability in Financial Disclosures publication-title: Journal of Finance doi: 10.1111/jofi.12162 – ident: e_1_2_13_44_1 doi: 10.1093/pan/mpn018 – volume: 113 start-page: 1228 issue: 523 year: 2018 ident: e_1_2_13_43_1 article-title: Estimation and Inference of Heterogeneous Treatment Effects Using Random Forests publication-title: Journal of the American Statistical Association doi: 10.1080/01621459.2017.1319839 |
SSID | ssj0011471 |
Score | 2.359987 |
Snippet | Textual analysis has gained significant interest in medical research, particularly for automated patient diagnosis based on clinical narratives. While... |
SourceID | crossref |
SourceType | Index Database |
Title | Causal Forests for Discovering Diagnostic Language in Electronic Health Records |
Volume | 41 |
hasFullText | 1 |
inHoldings | 1 |
isFullTextHit | |
isPrint | |
link | http://utb.summon.serialssolutions.com/2.0.0/link/0/eLvHCXMwnV07T8MwELYKLDAgnuItS7CmNLGdhLFqixDiJShSt8iPRFQqFWrahYmfzl2cOBEwAEtUOXZS-T7Z313O3xFy1lGwZ0Q68xgPmcel9j0gR76njYHtgYe6k-Hh5Nu78OqZX4_EqNX6aGQtLeaqrd9_PFfyH6tCG9gVT8n-wbLuodAAv8G-cAULw_VXNu7JRQ5TjOU183khrIBqmhqzMjEC0LdpdCjJelOGJTG8Magr35SHkKwPmjeJasVOgRvqF1k8oyiaU6TPqipb3ko3FcU_XGi-O1FyWh2eyXOJigiOLo_xcJDNz-29jOXMbQpPGF0xRdEj2wWTbd3dh4m0g7pY8jidlHfKaEUgXDqWW2AD7gEJsUGEtGoL0Y8VzVWZ-w30iR8XeyseK_NX1Y7wC2e9pVWf8b_sdC7_0Go1BwmOTYqxS2QlAEcDa2D0H50AGTiL1mWv_rUTuA3O6_c2KE2Dmww3yHrpVNCuRcgmaaXTLbJ26xR5821yb7FCS6xQwAptYIXWWKEVVuh4SmusUIsVWmJlhwwvB8PelVcW0_A00A9PaYUFXpVMhWZxxiMJXNDvSOFHzEghQqVj5MJhFsoMFnLDoZNBZSHDNQsl2yXLUzDwHqHmIo61z3QWKB_on7lQzAghmBKpDmQc75PTajqSNyuZknyf8INf9TokqzWKjsjyfLZIj4EFztVJYahPAaZhtQ |
linkProvider | Wiley-Blackwell |
openUrl | ctx_ver=Z39.88-2004&ctx_enc=info%3Aofi%2Fenc%3AUTF-8&rfr_id=info%3Asid%2Fsummon.serialssolutions.com&rft_val_fmt=info%3Aofi%2Ffmt%3Akev%3Amtx%3Ajournal&rft.genre=article&rft.atitle=Causal+Forests+for+Discovering+Diagnostic+Language+in+Electronic+Health+Records&rft.jtitle=Applied+stochastic+models+in+business+and+industry&rft.au=Albano%2C+Alessandro&rft.au=Di+Maria%2C+Chiara&rft.au=Sciandra%2C+Mariangela&rft.au=Plaia%2C+Antonella&rft.date=2025-09-01&rft.issn=1524-1904&rft.eissn=1526-4025&rft.volume=41&rft.issue=5&rft_id=info:doi/10.1002%2Fasmb.70038&rft.externalDBID=n%2Fa&rft.externalDocID=10_1002_asmb_70038 |
thumbnail_l | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/lc.gif&issn=1524-1904&client=summon |
thumbnail_m | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/mc.gif&issn=1524-1904&client=summon |
thumbnail_s | http://covers-cdn.summon.serialssolutions.com/index.aspx?isbn=/sc.gif&issn=1524-1904&client=summon |