Predicting breast cancer survivability based on machine learning and features selection algorithms: a comparative study

Breast cancer (BC) is considered the most common cause of cancer deaths in women. This study aims to identify BC early based on machine learning algorithms and features selection methods. The overall methodology of this work was modified based on knowledge data discovery (KDD) process, which include...

Full description

Saved in:
Bibliographic Details
Published inJournal of ambient intelligence and humanized computing Vol. 12; no. 8; pp. 8585 - 8623
Main Author El_Rahman, Sahar A.
Format Journal Article
LanguageEnglish
Published Berlin/Heidelberg Springer Berlin Heidelberg 01.08.2021
Springer Nature B.V
Subjects
Online AccessGet full text

Cover

Loading…
More Information
Summary:Breast cancer (BC) is considered the most common cause of cancer deaths in women. This study aims to identify BC early based on machine learning algorithms and features selection methods. The overall methodology of this work was modified based on knowledge data discovery (KDD) process, which include four datasets, preprocessing phase (data cleaning, data splitting to training and testing sets), processing phase (feature selection, k-folds validation, and classification) and finally model evaluation. This paper presents a comparison between different classifiers such as decision tree (DT), random forest (RF), logistic regression (LR), Naïve Bayes (NB), K-nearest neighbor (KNN), and support vector machine (SVM). Four different breast cancer datasets (Wisconsin prognosis breast cancer (WPBC), Wisconsin diagnosis breast cancer (WDBC), Wisconsin Breast Cancer (WBC), and Mammographic Mass Dataset (MM-Dataset) based on BI-RADS findings) are conducted in the experiments. The proposed models were evaluated by utilizing classification accuracy and confusion matrix. The experimental results indicate that the classification based on RF technique with the Genetic Algorithm (GA) as a feature selection method is better than the other classifiers with an accuracy value 96.82% using WBC dataset. In WDBC dataset, the results indicate that the classification utilizing C-SVM technique with the applied kernel function RBF (Radial Basis Function) is superior to the other classifiers with an accuracy value 99.04%. In WPBC dataset, the results indicate that the classification using RF technique with recursive feature elimination (RFE) as a feature selection method is better than the other classifiers with an accuracy value 74.13%. In MM-Dataset, the results indicate that the classification using DT technique is better than the other classifiers with an accuracy value 83.74%. The findings indicate that the proposed models are effective by comparing with others existing models.
ISSN:1868-5137
1868-5145
DOI:10.1007/s12652-020-02590-y