A Machine Learning Algorithm With an Oversampling Technique in Limited Data Scenarios for the Prediction of Present and Future Restorative Treatment Need: Development and Validation Study

Untreated dental caries is the most common health condition worldwide. Therefore, new strategies need to be developed to reduce the manifestations of dental caries. This study aimed to develop and test a machine learning (ML) algorithm for detecting present and predicting future carious lesions in t...

Full description

Saved in:
Bibliographic Details
Published inJMIR medical informatics Vol. 13; p. e75117
Main Authors Väyrynen, Elina, Tirkkonen, Otso, Tiensuu, Henna, Suutala, Jaakko, Anttonen, Vuokko, Laitala, Marja-Liisa, Kukkola, Katri, Karki, Saujanya
Format Journal Article
LanguageEnglish
Published Canada 28.08.2025
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Untreated dental caries is the most common health condition worldwide. Therefore, new strategies need to be developed to reduce the manifestations of dental caries. This study aimed to develop and test a machine learning (ML) algorithm for detecting present and predicting future carious lesions in the adolescent population using a set of easy-to-collect predictive variables. In addition, this study aimed to deal with an imbalanced and small dataset using an oversampling method. This population-based study was conducted among secondary schoolchildren, aged between 13 and 17 years, from the northern parts of Finland in 2022. After meeting the inclusion criteria, a total of 218 participants were included in this study. The inclusion criteria consisted of participants having completed a web-based risk assessment questionnaire and having undergone a clinical examination at public health care services. Dental caries (International Caries Detection and Assessment System [ICDAS] scores of 4, 5, and 6; ie, ICDAS 4-6) and active initial caries (ICDAS 2+, 3+) were considered as outcomes. Several predictors, such as behavioral and dietary habits, were included. An extreme gradient boosting model was developed, tested, and assessed for its predictive performance. A 4-fold cross-validation was performed using the nested resampling technique. The random oversampling examples method and the k-nearest neighbors classifiers were used for all 4 folds. The mean (SD) performance of all the folds was computed. Dental caries (ICDAS 2+,3+,4-6) were prevalent in 65.6% (143/218) of the participants. The mean area under the curve was 0.77 (SD 0.04) and the mean F -score was 0.82 (SD 0.06) for the extreme gradient boosting model. Similarly, the mean area under the curve and mean F -scores after oversampling were 0.74 (SD 0.05) and 0.79 (SD 0.04), respectively. The Shapley additive explanation values were calculated for all 4 folds to assess feature importance, revealing that previous dental fillings were the feature most strongly associated with the need for restorative treatment. On the basis of the performance metrics, the ML algorithm developed and tested in this study can be considered good. The ML algorithm could serve as a cost-effective screening tool for dental professionals to identify the risk of future restorative treatment needs. However, future studies with longitudinal cohorts and longitudinal data, along with external validation for generalizability, are needed to validate our model.
ISSN:2291-9694
DOI:10.2196/75117