Evaluating dimensionality reduction of comorbidities for predictive modeling in individuals with neurofibromatosis type 1

Objective Dimensionality reduction techniques aim to enhance the performance of machine learning (ML) models by reducing noise and mitigating overfitting. We sought to compare the effect of different dimensionality reduction methods for comorbidity features extracted from electronic health records (...

Full description

Saved in:

Bibliographic Details
Published in	JAMIA open Vol. 8; no. 1; p. ooae157
Main Authors	Gupta, Aditi, Hillis, Ethan, Oh, Inez Y, Morris, Stephanie M, Abrams, Zach, Foraker, Randi E, Gutmann, David H, Payne, Philip R O
Format	Journal Article
Language	English
Published	United States Oxford University Press 01.02.2025
Subjects	Algorithms Attention-deficit hyperactivity disorder Comorbidity Electronic health records Genetic disorders Healthcare industry software Machine learning Medical colleges Medical records Medical research Medicine, Experimental Neurofibromatosis Noise control Research and Applications Missouri neurofibromatosis type 1 electronic health records clinical research informatics predictive modeling
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Objective Dimensionality reduction techniques aim to enhance the performance of machine learning (ML) models by reducing noise and mitigating overfitting. We sought to compare the effect of different dimensionality reduction methods for comorbidity features extracted from electronic health records (EHRs) on the performance of ML models for predicting the development of various sub-phenotypes in children with Neurofibromatosis type 1 (NF1). Materials and Methods EHR-derived data from pediatric subjects with a confirmed clinical diagnosis of NF1 were used to create 10 unique comorbidities code-derived feature sets by incorporating dimensionality reduction techniques using raw International Classification of Diseases codes, Clinical Classifications Software Refined, and Phecode mapping schemes. We compared the performance of logistic regression, XGBoost, and random forest models utilizing each feature set. Results XGBoost-based predictive models were most successful at predicting NF1 sub-phenotypes. Overall, features based on domain knowledge-informed mapping schema performed better than unsupervised feature reduction methods. High-level features exhibited the worst performance across models and outcomes, suggesting excessive information loss with over-aggregation of features. Discussion Model performance is significantly impacted by dimensionality reduction techniques and varies by specific ML algorithm and outcome being predicted. Automated methods using existing knowledge and ontology databases can effectively aggregate features extracted from EHRs. Conclusion Dimensionality reduction through feature aggregation can enhance the performance of ML models, particularly in high-dimensional datasets with small sample sizes, commonly found in EHRs health applications. However, if not carefully optimized, it can lead to information loss and data oversimplification, potentially adversely affecting model performance. Lay Summary Dimensionality reduction, a technique used to simplify data by reducing noise and overfitting, plays a key role in enhancing the performance of machine learning (ML) models. This study assessed various dimensionality reduction methods applied to comorbidity features extracted from the electronic health records (EHRs) of children with Neurofibromatosis type 1 (NF1). Due to extreme heterogeneity in the clinical sub-phenotypes arising in people with NF1, it is difficult to predict who will develop one or more of the many NF1-associated clinical sub-phenotypes. Using the reduced feature sets derived from diagnostic codes, 3 ML models were employed to predict NF1 sub-phenotypes such as optic pathway glioma, attention-deficit hyperactivity disorder, and scoliosis. The study demonstrated that model performance is significantly impacted by the choice of dimensionality reduction technique and varies depending on the specific ML algorithm and the predicted outcome. Automated methods utilizing existing knowledge and ontology databases can effectively aggregate features derived from EHRs. Feature aggregation through dimensionality reduction can significantly boost ML model performance, particularly in high-dimensional datasets with small sample sizes, which are common in EHR-based health applications. However, if not carefully optimized, dimensionality reduction can lead to information loss and data oversimplification, potentially negatively affecting model performance.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 A. Gupta and E. Hillis contributed equally and are considered co-first authors of this work.
ISSN:	2574-2531 2574-2531
DOI:	10.1093/jamiaopen/ooae157