Microarray Gene Expression Data Classification Via Wilcoxon Sign Rank Sum and Novel Grey Wolf Optimized Ensemble Learning Models

Cancer is a deadly disease that affects the lives of people all over the world. Finding a few genes relevant to a single cancer disease can lead to effective treatments. The difficulty with microarray datasets is their high dimensionality; they have a large number of features in comparison to the sm...

Full description

Saved in:

Bibliographic Details
Published in	IEEE/ACM transactions on computational biology and bioinformatics Vol. 20; no. 6; pp. 1 - 14
Main Authors	Saheed, Yakub K., Balogun, Bukola F., Odunayo, Braimah J., Mustapha, Abdulsalam
Format	Journal Article
Language	English
Published	United States IEEE 01.11.2023 The Institute of Electrical and Electronics Engineers, Inc. (IEEE)
Subjects	Algorithms Cancer Colon Colon cancer Colonic Neoplasms - genetics Datasets DNA microarrays Ensemble learning Feature extraction Feature selection Gene Expression Genes Humans Machine Learning Microarray Prediction algorithms Redundancy Tumors Wilcoxon Sign Rank Sum Xgboost
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Cancer is a deadly disease that affects the lives of people all over the world. Finding a few genes relevant to a single cancer disease can lead to effective treatments. The difficulty with microarray datasets is their high dimensionality; they have a large number of features in comparison to the small number of samples in these datasets. Additionally, microarray data typically exhibit significant asymmetry in dimensionality as well as high levels of redundancy and noise. It is widely held that the majority of genes lack informative value about the classes under study. Recent research has attempted to reduce this high dimensionality by employing various feature selection techniques. This paper presents new ensemble feature selection techniques via the Wilcoxon Sign Rank Sum test (WCSRS) and the Fisher's test (F-test). In the first phase of the experiment, data preprocessing was performed; subsequently, feature selection was performed via the WCSRS and F-test in such a way that the (probability values) p-values of the WCRSR and F-test were adopted for cancerous gene identification. The extracted gene set was used to classify cancer patients using ensemble learning models (ELM), random forest (RF), extreme gradient boosting (Xgboost), cat boost, and Adaboost. To boost the performance of the ELM, we optimized the parameters of all the ELMs using the Grey Wolf optimizer (GWO). The experimental analysis was performed on colon cancer, which included 2000 genes from 62 patients (40 malignant and 22 benign). Using a WCSRS test for feature selection, the optimized Xgboost demonstrated 100% accuracy. The optimized cat boost, on the other hand, demonstrated 100% accuracy using the F-test for feature selection. This represents a 15% improvement over previously reported values in the literature.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 23
ISSN:	1545-5963 1557-9964
DOI:	10.1109/TCBB.2023.3305429