Evaluating dimensionality reduction of comorbidities for predictive modeling in individuals with neurofibromatosis type 1
Objective Dimensionality reduction techniques aim to enhance the performance of machine learning (ML) models by reducing noise and mitigating overfitting. We sought to compare the effect of different dimensionality reduction methods for comorbidity features extracted from electronic health records (...
Saved in:
Published in | JAMIA open Vol. 8; no. 1; p. ooae157 |
---|---|
Main Authors | , , , , , , , |
Format | Journal Article |
Language | English |
Published |
United States
Oxford University Press
01.02.2025
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | Objective
Dimensionality reduction techniques aim to enhance the performance of machine learning (ML) models by reducing noise and mitigating overfitting. We sought to compare the effect of different dimensionality reduction methods for comorbidity features extracted from electronic health records (EHRs) on the performance of ML models for predicting the development of various sub-phenotypes in children with Neurofibromatosis type 1 (NF1).
Materials and Methods
EHR-derived data from pediatric subjects with a confirmed clinical diagnosis of NF1 were used to create 10 unique comorbidities code-derived feature sets by incorporating dimensionality reduction techniques using raw International Classification of Diseases codes, Clinical Classifications Software Refined, and Phecode mapping schemes. We compared the performance of logistic regression, XGBoost, and random forest models utilizing each feature set.
Results
XGBoost-based predictive models were most successful at predicting NF1 sub-phenotypes. Overall, features based on domain knowledge-informed mapping schema performed better than unsupervised feature reduction methods. High-level features exhibited the worst performance across models and outcomes, suggesting excessive information loss with over-aggregation of features.
Discussion
Model performance is significantly impacted by dimensionality reduction techniques and varies by specific ML algorithm and outcome being predicted. Automated methods using existing knowledge and ontology databases can effectively aggregate features extracted from EHRs.
Conclusion
Dimensionality reduction through feature aggregation can enhance the performance of ML models, particularly in high-dimensional datasets with small sample sizes, commonly found in EHRs health applications. However, if not carefully optimized, it can lead to information loss and data oversimplification, potentially adversely affecting model performance.
Lay Summary
Dimensionality reduction, a technique used to simplify data by reducing noise and overfitting, plays a key role in enhancing the performance of machine learning (ML) models. This study assessed various dimensionality reduction methods applied to comorbidity features extracted from the electronic health records (EHRs) of children with Neurofibromatosis type 1 (NF1). Due to extreme heterogeneity in the clinical sub-phenotypes arising in people with NF1, it is difficult to predict who will develop one or more of the many NF1-associated clinical sub-phenotypes. Using the reduced feature sets derived from diagnostic codes, 3 ML models were employed to predict NF1 sub-phenotypes such as optic pathway glioma, attention-deficit hyperactivity disorder, and scoliosis. The study demonstrated that model performance is significantly impacted by the choice of dimensionality reduction technique and varies depending on the specific ML algorithm and the predicted outcome. Automated methods utilizing existing knowledge and ontology databases can effectively aggregate features derived from EHRs. Feature aggregation through dimensionality reduction can significantly boost ML model performance, particularly in high-dimensional datasets with small sample sizes, which are common in EHR-based health applications. However, if not carefully optimized, dimensionality reduction can lead to information loss and data oversimplification, potentially negatively affecting model performance. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14 content type line 23 A. Gupta and E. Hillis contributed equally and are considered co-first authors of this work. |
ISSN: | 2574-2531 2574-2531 |
DOI: | 10.1093/jamiaopen/ooae157 |