Breast cancer prediction with feature-selected XGB classifier, optimized by metaheuristic algorithms

Breast cancer, caused by uncontrolled cell growth in milk ducts or lobules, has the highest mortality rate among women worldwide, with Asia reporting the most deaths. Early detection improves survival rates and reduces treatment costs. This study aims to develop a feature selection-based classifier...

Full description

Saved in:
Bibliographic Details
Published inJournal of big data Vol. 12; no. 1; pp. 78 - 26
Main Authors Sarker, Proshenjit, Ksibi, Amel, Jamjoom, Mona M., Choi, Kwonhue, Nahid, Abdullah Al, Samad, Md Abdus
Format Journal Article
LanguageEnglish
Published Cham Springer International Publishing 01.04.2025
Springer Nature B.V
SpringerOpen
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Breast cancer, caused by uncontrolled cell growth in milk ducts or lobules, has the highest mortality rate among women worldwide, with Asia reporting the most deaths. Early detection improves survival rates and reduces treatment costs. This study aims to develop a feature selection-based classifier to enhance breast cancer prediction, using a minimal dataset to maximize performance. Earlier works on the Breast Cancer Coimbra dataset used many features but failed to achieve high accuracy or explain misclassifications. We addressed this by reducing the features while maintaining performance. A Wrapper Model with Metaheuristic Algorithms: Whale Optimization, Bald Eagle Search, and Sea Lion Optimization and Extreme Gradient Boost Classifier. SHAP explained feature importance for both overall and individual predictions. Our study achieved F-scores of 97.43%, 95%, and 94.74% for SLOA_XGB, BESA_XGB, and WOA_XGB, respectively, on the Breast Cancer Coimbra dataset. Each method reduced the features from 9 to 4. SHAP analysis identified Glucose as having the highest impact on model predictions. Additionally, we found a link between the mean values of certain features and misclassification likelihood. This study analyzed data from 116 subjects, with the SLOA_XGB classifier achieving the best performance: 97.43% F-score, 97.14% accuracy, 97.14% precision, and 100% recall using only Glucose, Age, Resistin, and Adiponectin. These results highlight the potential for early breast cancer detection with fewer features while maintaining high predictive accuracy.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:2196-1115
2196-1115
DOI:10.1186/s40537-025-01132-7