Feature selection and computational optimization in high-dimensional microarray cancer datasets via InfoGain-modified bat algorithm

Achieving a satisfactory cancer classification accuracy with the complete set of genes remains a great challenge, due to the high dimensions, small sample size, and presence of noise in gene expression data. Feature reduction is critical and sensitive in the classification task, most importantly in...

Full description

Saved in:

Bibliographic Details
Published in	Multimedia tools and applications Vol. 81; no. 25; pp. 36505 - 36549
Main Authors	Hambali, Moshood A., Oladele, Tinuke O., Adewole, Kayode S., Sangaiah, Arun Kumar, Gao, Wei
Format	Journal Article
Language	English
Published	New York Springer US 01.10.2022 Springer Nature B.V
Subjects	1213: Computational Optimization and Applications for Heterogeneous Multimedia Data Accuracy Algorithms Cancer Classification Computer Communication Networks Computer Science Data Structures and Information Theory Datasets Decision trees Feature selection Gene expression Genes Multimedia Multimedia Information Systems Optimization Regression analysis Special Purpose and Application-Based Systems Computational optimization Feature selection Information gain Microarray data Binary bat algorithm Random forest Cancer classification
Online Access	Get full text

Cover

Loading…

More Information
Summary:	Achieving a satisfactory cancer classification accuracy with the complete set of genes remains a great challenge, due to the high dimensions, small sample size, and presence of noise in gene expression data. Feature reduction is critical and sensitive in the classification task, most importantly in heterogeneous multimedia data. One of the major drawbacks in cancer study is recognizing informative genes from thousands of available genes in microarray data. Traditional feature selection algorithms have failed to scale on large space data like microarray data. Therefore, an effective feature selection algorithm is required to explore the most significant subset of genes by removing non-predictive genes from the dataset without compromising the accuracy of the classification algorithm. The study proposed an information Gain – Modified Bat Algorithm (InfoGain-MBA) features selection model for selecting relevant and informative features from high dimensional Microarray cancer datasets and evaluate the approach with four classifiers - C4.5, Decision Tree, Random Forest and classification and regression tree (CART). The results obtained show that the proposed approach is promising for the classification of microarray cancer data. The random forest has 100% accuracy with few genes in all seven datasets used. Further investigations were also conducted to determine the optimal threshold for each of the datasets.
Bibliography:	ObjectType-Article-1 SourceType-Scholarly Journals-1 ObjectType-Feature-2 content type line 14
ISSN:	1380-7501 1573-7721
DOI:	10.1007/s11042-022-13532-5