An improved breast cancer disease prediction system using ML and PCA

Computer-aided diagnosis (CAD) systems based on machine learning (ML) techniques have altered the field of medical research. The deployement of such models to classify breast cancer is one area of many where exactness has been the main preoccupation. CAD systems aim to reach the performance of train...

Full description

Saved in:
Bibliographic Details
Published inMultimedia tools and applications Vol. 83; no. 11; pp. 33785 - 33821
Main Authors Laghmati, Sara, Hamida, Soufiane, Hicham, Khadija, Cherradi, Bouchaib, Tmiri, Amal
Format Journal Article
LanguageEnglish
Published New York Springer US 01.03.2024
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Computer-aided diagnosis (CAD) systems based on machine learning (ML) techniques have altered the field of medical research. The deployement of such models to classify breast cancer is one area of many where exactness has been the main preoccupation. CAD systems aim to reach the performance of trained clinicians in identifying breast cancer at its early stages, thus optimizing the outcome for breast cancer patients while reducing the cost of treatment. This paper presents a supervised machine learning CAD system for breast cancer classification based on feature selection, PCA, grid search for hyperparameter tuning, and cross-validation. The system draws on seven ML classifiers ANN, k-NN, SVM, DT, RF, XGboost, and Adaboost. Two ensemble models were developed by concatenating the prediction of each ML model using Majority voting and stacking with Logistic Regression S-LR for the final prediction. The system's performance is evaluated by computing various evaluation metrics, mainly accuracy, specificity, precision, recall, Matthews Correlation Coefficient, Jaccard, and F1-score. To this end, the data sets used are Wisconsin and Mass mammography. The results indicate that the XGboost model achieved the highest recall of over 96% for the Mammographic Mass dataset. While for the WBCD, both the AdaBoost and the S-LR models outperformed the others with a Recall of 95.35%. The stacking with logistic regression ensemble model obtained the highest accuracies of 93.37% for the Mammographic Mass dataset and 97.37% for the WBCD. Accordingly, the proposed model can be suggested to assist in decision-making in classifying breast cancer tumors. Therefore, a Flask application using the S-LR model is developed.
Bibliography:ObjectType-Article-1
SourceType-Scholarly Journals-1
ObjectType-Feature-2
content type line 14
ISSN:1573-7721
1380-7501
1573-7721
DOI:10.1007/s11042-023-16874-w