Setting up of a machine learning algorithm for the identification of severe liver fibrosis profile in the general US population cohort
•European and American guidelines suggest that liver biopsy should be reserved for patients at high risk of advanced liver disease. Therefore, noninvasive diagnosis methodologies for wide-population screenings are highly needed.•Based on the NHANES dataset (2017–2020 pre-pandemic) that comprised 726...
Saved in:
Published in | International journal of medical informatics (Shannon, Ireland) Vol. 170; p. 104932 |
---|---|
Main Authors | , , , , , |
Format | Journal Article |
Language | English |
Published |
Ireland
Elsevier B.V
01.02.2023
|
Subjects | |
Online Access | Get full text |
Cover
Loading…
Summary: | •European and American guidelines suggest that liver biopsy should be reserved for patients at high risk of advanced liver disease. Therefore, noninvasive diagnosis methodologies for wide-population screenings are highly needed.•Based on the NHANES dataset (2017–2020 pre-pandemic) that comprised 7267 eligible subjects, we set up a machine-learning-based algorithm that succeeded in classifying individuals with severe liver inflammation (Fibroscan values ≥ 9.7 KPa), utilizing 26 standard parameters.•One relevant technical challenge of the study was to deal with highly imbalanced dataset toward the target: only ∼ 5 % of the study cohort presented the target clinical condition (significant liver stiffness). To overcome data imbalance, we applied the oversampling technique SMOTE-NC.
The progress of digital transformation in clinical practice opens the door to transforming the current clinical line for liver disease diagnosis from a late-stage diagnosis approach to an early-stage based one. Early diagnosis of liver fibrosis can prevent the progression of the disease and decrease liver-related morbidity and mortality. We developed here a machine learning (ML) algorithm containing standard parameters that can identify liver fibrosis in the general US population.
Starting from a public database (National Health and Nutrition Examination Survey, NHANES), representative of the American population with 7265 eligible subjects (control population n = 6828, with Fibroscan values E < 9.7 KPa; target population n = 437 with Fibroscan values E ≥ 9.7 KPa), we set up an SVM algorithm able to discriminate for individuals with liver fibrosis among the general US population. The algorithm set up involved the removal of missing data and a sampling optimization step to managing the data imbalance (only ∼ 5 % of the dataset is the target population).
For the feature selection, we performed an unbiased analysis, starting from 33 clinical, anthropometric, and biochemical parameters regardless of their previous application as biomarkers of liver diseases. Through PCA analysis, we identified the 26 more significant features and then used them to set up a sampling method on an SVM algorithm. The best sampling technique to manage the data imbalance was found to be oversampling through the SMOTE-NC. For final model validation, we utilized a subset of 300 individuals (150 with liver fibrosis and 150 controls), subtracted from the main dataset prior to sampling. Performances were evaluated on multiple independent runs.
We provide proof of concept of an ML clinical decision support tool for liver fibrosis diagnosis in the general US population. Though the presented ML model represents at this stage only a prototype, in the future, it might be implemented and potentially applied to program broad screenings for liver fibrosis. |
---|---|
Bibliography: | ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23 |
ISSN: | 1386-5056 1872-8243 1872-8243 |
DOI: | 10.1016/j.ijmedinf.2022.104932 |